TL;DR We provide parallel corpora for eight under-represented languages in the Middle East along with scripts to fine-tune NLLB. The resources are open-source. Please consider these languages in your projects!
The Middle East is characterized by remarkable linguistic diversity, with over 500 million inhabitants speaking more than 60 languages across multiple language families. In a first attempt of its kind, we create parallel corpora for the following eight under-represented languages of the Middle East:
- Luri Bakhtiari (
bqi_Arab
) - Gilaki (
glk_Arab
) - Hawrami (
hac_Arab
) - Laki (
lki_Arab
) - Mazanderani (
mzn_Arab
) - Southern Kurdish (
sdh_Arab
) - Talysh (
tly_Arab
) - Zazaki (
zza_Arab
)
This repository provides documentation on our project that aims to develop machine translation for low-resource languages in the Middle East. The fine-tuned model is available on HuggingFace at https://huggingface.co/SinaAhmadi/NLLB-DOLMA.
The project couldn't be possible without the enthusiasm and passion of over 50 volunteers who contributed to the project by translating over 36,000 sentences in a period of a few months. You are free to use everything, but please be mindful to acknowledge the project if you use it. Read the license.
All the parallel corpora are provided in the corpora folder. These are TSV files containing parallel sentences in three languages: English-Farsi-X. For Zazaki, exceptionally, the test set only contains English-Zazaki pairs and the remaining data is English-Kurmanji-Zazaii. These files are the most comprehensive ones with the following meta-data:
en_sentence
: sentence in Englishfa_sentence
: sentence in Farsikmr_sentence
: sentence in Farsitranslation
: only for Zazaki, sentence in Kurmanji (Northern Kurdish)variety
: variety (dialect)county
: approximate region where the translator comes fromorthography
: orthographytranslator
: translator ID (for internal tracking)
When processing these files, you can extract the relevant columns as all the fields in the metadata are not useful for MT.
Some of the translated sentences are aligned with the same source sentences in English or Farsi, creating a multilingual parallel corpus ideal for cross-lingual studies. We have merged all sentences across languages based on the English and Farsi source sentences. This multilingual corpus is available at corpora/common_sentences.tsv. To find the meta-data, you should look up the sentences in the original corpora.
The number of parallel sentences across languages are provided in the following coverage matrix. For instance, Laki has 437 and 849 sentences aligned with Gilaki and Talysh, respectively.
If you are looking for dataset splits, check out the datasets folder where sentences in the parallel corpora are split into test, validation and train sets. To create these splits, we follow a set of rules, explained in detail in the paper. As such, we highly recommend using these splits as a normal 10/20/70 split proportion won't probably represent the dialects and orthographies fairly.
All the dataset splits are also available in jsonlines
format, ready for training/fine-tuning using Hugging Face. These files are provided in the fine-tune folder. The base
and augmented
sub-folders refer to the data setups reported in the paper. Within each folder, you can find the prefixed sentences as well.
The datasets used in the ablation studies are provided in ablation and samples folders.
To summarize, these are the available corpora and datasets per language:
Language | Parallel languages | # Sentence pairs | # Varieties | Orthographies | Download | JSONL |
---|---|---|---|---|---|---|
Luri Bakhtiari (BQI) | English, Farsi | 1998 | Central | Pāpêrik | corpus | test / val | val |
Gilaki (GLK) | English, Farsi | 5420 | Eastern, Western | Vrg, Sarkhat, Other | corpus | test / val / train | train / val |
Hawrami (HAC) | English, Farsi | 7794 | Lhon, Jawaru, Hawraman Takht | Kurdish (two variants) | corpus | test / val / train | train / val |
Laki Kurdish (LKI) | English, Farsi | 3418 | Kakavandi, Jalalvan, Hozmanvan, Sahneyi | Kurdish | corpus | test / val / train | train / val |
Mazandarni (MZN) | English, Farsi | 4345 | Central | Farsi | corpus | test / val / train | train / val |
Southern Kurdish (SDH) | English, Farsi | 9797 | Pehley, Garusi, Kalhori, Kirmashani, Badrei | Kurdish | corpus | test / val / train | train / val |
Talysh (TLY) | English, Farsi | 2106 | Southern | Farsi (diacritized) | corpus | test / val | val |
Zazaki (ZZA) | English, Kurmanji | 4401 | Northern, Central, Southern | Kurdish | en-zza corpus / en-kmr corpus | test / train / val | train / val |
Although it's not the main contribution of the project, we release all the scripts used for preparing the corpora, data splits and fine-tuning. Please note that the codes are not optimized and you might need to change directory of files (it might be easier to simply work with the datasets and the corpora, tbh!). Additional codes are provided in the utils folder (including for visualization).
create_corpus.py
: this implements the semantic and string based similarity measures described in the paper to ensure that a diverse set of sentences are extracted from the corpus. Make sure to updatecodes/data.json
by specifying the directory of your files.codes/extract_sentences.py
: if you have a monolingual corpus, use this script to extract sentences for translation into a high-resource language.codes/random_sampler.py
: randomly selected sentences from a monolingual corpus.
nllb_prepare_data.py
: Prepares datasets for NLLB fine-tuning with both base and augmented configurations.combine_data.py
: merges all the individual jsonl files into oneprepend_lang_code.py
: prepends language indicator token to the beginning of each sentence.sampler_size.py
: samples from the datasets by incrementally selecting 100 sentences per language. Commands to train models on these samples are provided incodes/train_samples.sh
.sampler_exclusive.py
: creates samples of 1000 sentences missing data from a language each time.
fine-tune.py
: initializes NLLB (600M distilled) by adding new token indicators for our selected languages.run_translation.py
: this is a modified version of Hugging Face's fine-tuning code with the main difference being on tokenization. We remove the source and target language tokens as arguments.- For more information on fine-tuning NLLB, check this and this.
evaluate-zero-shot.py
: zero-shot evaluation of NLLBscorer.py
: calculates BLEU & chrF scores on the output of the zero-shot evaluationmodels_evaluate.py
: evaluates fine-tuned modelspes-eng-bleu.py
: calculates BLEU & chrF on the translation of the second reference (in Farsi or Kurmanji)
- The translation guides provided to the contributors are available in English, Central Kurdish and Farsi.
- The output of the baseline and fine-tuned models are provided in experiments.
This project is fully open-source with the extremely permissive MIT license. Please be mindful that there is much effort going into this!
This project was carried out during my employment as a postdoc at the University of Zurich. It also received generous support from the amazing SILICON initiative at Stanford University. Additionally, we had to mobilize a community of over 50 wonderful volunteers who participated in the translation initiative, shared their content or simply spread the word on social media. By making it fully open-source, we hope that more researchers, in both academia and industry, consider working on these under-represented languages. We also hope the parallel corpora will be crawled and included in training LLM in the future.
Any support to sustain this initiative, as well as research collaborations to expand these resources, is welcome. For collaboration inquiries, don't hesitate to reach out.
If you're using this project, please cite this paper:
@inproceedings{ahmadi2025memt,
title = {{PARME}: Parallel Corpora for Low-Resourced {Middle Eastern} Languages},
author = {
Ahmadi, Sina and
Sennrich, Rico and
Karami, Erfan and
Marani, Ako and
Fekrazad, Parviz and
Akbarzadeh Baghban, Gholamreza and
Hadi, Hanah and
Heidari, Semko and
Dogan, Mahîr and
Asadi, Pedram and
Bashir, Dashne and
Ghodrati, Mohammad Amin and
Amini, Kourosh and
Ashourinezhad, Zeynab and
Baladi, Mana and
Ezzati, Farshid and
Ghasemifar, Alireza and
Hosseinpour, Daryoush and
Abbaszadeh, Behrooz and
Hassanpour, Amin and
Jalal Hamaamin, Bahaddin and
Kamal Hama, Saya and
Mousavi, Ardeshir and
Nazir Hussein, Sarko and
Nejadgholi, Isar and
Ölmez, Mehmet and
Osmanpour, Horam and
Roshan Ramezani, Rashid and
Sediq Aziz, Aryan and
Salehi Sheikhalikelayeh, Ali and
Yadegari, Mohammadreza and
Yadegari, Kewyar and
Zamani Roodsari, Sedighe
},
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics",
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics"
}