Should we include Cross-Lingual tasks in MMTEB? #362

orionw · 2024-04-13T17:03:51Z

orionw
Apr 13, 2024
Maintainer

(started from a discussion in #347 on cross-lingual datasets with @KennethEnevoldsen and @izhx. Also tagging @Muennighoff)

Should we include cross-lingual datasets in MMTEB (such as xPCA, xQA, xOR-TyDiQA)?

Where I'm located (near Washington DC) people care a lot about cross-lingual but only for English->XX (where we have an English query and we're looking for documents in other languages). There are a decent amount of resources for these English-centric cross-lingual tasks, as listed above.

However, given that this benchmark is a worldwide effort, it might not make sense to focus on only En->XX or XX->En. On the other hand, given the number potential cross-lingual categories it would difficult to get a thorough benchmark for the cross-product of languages, as I doubt that there are existing resources for many of these pairs. However, we could include what we can find in a separate cross-lingual category.

What are people's thoughts on this? I might lean towards not supporting it in the initial version of MMTEB, but I don't have a strong preference.

Muennighoff · 2024-04-13T19:42:39Z

Muennighoff
Apr 13, 2024
Maintainer

Great point! My view is that we should support them in the code & let anyone add them (with points). For the leaderboard later, I think we could have one "crosslingual" tab in addition to per-language tabs. Note that for STS we already have crosslingual datasets in the code & LB, but currently they are just in the Other tab.

2 replies

KennethEnevoldsen Apr 14, 2024
Maintainer

I agree with @Muennighoff here on this point. Essentially MTEB becomes a repository of tasks and combinations of tasks to create benchmarks.

We also previously discussed updating the HF interactive benchmark to allow for selecting the tasks relevant for your use case ("build your own benchmark"-style)

One could imagine both the custom style benchmark along with a set of standard benchmarks (English, Code Retrieval, Multilingual etc.).

orionw Apr 14, 2024
Maintainer Author

Sound like a great solution! Thanks @Muennighoff and @KennethEnevoldsen!

imenelydiaker · 2024-04-21T16:09:53Z

imenelydiaker
Apr 21, 2024
Maintainer

English->XX or XX->English is already a good base for cross-lingual tasks. STS supports this, but it would be nice to extend it to Retrieval, Reranking and maybe Summarization (not sure about existing datasets here) tasks. For XX -> YY (where XX != English), mlqa is also a good resource.

@orionw should we just add new tasks inheriting from the current AbsTasks, or should we think about adding someting to point that it is a cross-lingual task (inheriting from CrossLingualTask for example)?

9 replies

isaac-chung Apr 25, 2024
Collaborator

A specified direction would definitely help clarify things. I'm leaning towards option 2 or even 3.

For cross lingual retrieval tasks, we could even specify which lang is query and which lang is corpus, something "from" and "to" may not be enough to clarify.

imenelydiaker Apr 25, 2024
Maintainer

This can be enough imo:

{
"hf_lang": {"lang1", "lang2"}, # undirected
"hf_lang": ("lang1", "lang2"), # directed lang1 --> lang2
}

We can add a description in the docstring of the retrieval task to say that for a pair (XX, YY): XX is for the corpus and YY for questions, the direction will still be XX -> YY.
Unless you want to be more specific, I'm okay with both.

isaac-chung Apr 25, 2024
Collaborator

I was only leaning to 3 for the same reasons we use pydantic base models to validate fields, just a bit more explicit.

I do think adding to the docstring is helpful. Maybe we could ask for it to be included for each new addition.

KennethEnevoldsen Apr 25, 2024
Maintainer

Sounds like 2 or 3. A simple compromise is to use a named tuple with "from" "to". And then allow tuples in the mapping.

# Python code to demonstrate namedtuple()
from collections import namedtuple

# Declaring namedtuple()
Lang2Lang = namedtuple('Lang2Lang', ['from', 'to'])

# Adding values
lang2lang_tuple = Lang2Lang(lang1, lang2)

imenelydiaker Apr 25, 2024
Maintainer

Okay I'll try to add this in the same PR as MLQA #560

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should we include Cross-Lingual tasks in MMTEB? #362

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Should we include Cross-Lingual tasks in MMTEB? #362

orionw Apr 13, 2024 Maintainer

Replies: 2 comments · 11 replies

Muennighoff Apr 13, 2024 Maintainer

KennethEnevoldsen Apr 14, 2024 Maintainer

orionw Apr 14, 2024 Maintainer Author

imenelydiaker Apr 21, 2024 Maintainer

isaac-chung Apr 25, 2024 Collaborator

imenelydiaker Apr 25, 2024 Maintainer

isaac-chung Apr 25, 2024 Collaborator

KennethEnevoldsen Apr 25, 2024 Maintainer

imenelydiaker Apr 25, 2024 Maintainer

orionw
Apr 13, 2024
Maintainer

Replies: 2 comments 11 replies

Muennighoff
Apr 13, 2024
Maintainer

KennethEnevoldsen Apr 14, 2024
Maintainer

orionw Apr 14, 2024
Maintainer Author

imenelydiaker
Apr 21, 2024
Maintainer

isaac-chung Apr 25, 2024
Collaborator

imenelydiaker Apr 25, 2024
Maintainer

isaac-chung Apr 25, 2024
Collaborator

KennethEnevoldsen Apr 25, 2024
Maintainer

imenelydiaker Apr 25, 2024
Maintainer