Différences entre les versions de « 2024 Workshop on Breton Language Technologies »

De Arbres
 
(87 versions intermédiaires par 5 utilisateurs non affichées)
Ligne 1 : Ligne 1 :
The CNRS laboratory IKER is organizing the '''2024 Workshop on Breton Language Technologies''', which will take place at the University of Quimper (Brittany) on June 8.  
The CNRS laboratory IKER and the U. Bretagne Ouest UBO are organizing the '''2024 Workshop on Breton Language Technologies''', which will take place at the University of Quimper (Brittany) on June 8, [https://www.google.com/maps/place/P%C3%B4le+universitaire+Pierre-Jakez+H%C3%A9lias+-+Universit%C3%A9+de+Bretagne+Occidentale+(UBO)/@47.972174,-4.0960801,16z/data=!4m14!1m7!3m6!1s0x4810d5ca3d1dde93:0xe62a6150e5d5af5d!2sP%C3%B4le+universitaire+Pierre-Jakez+H%C3%A9lias+-+Universit%C3%A9+de+Bretagne+Occidentale+(UBO)!8m2!3d47.9721704!4d-4.0935052!16s%2Fg%2F1hhmw77z8!3m5!1s0x4810d5ca3d1dde93:0xe62a6150e5d5af5d!8m2!3d47.9721704!4d-4.0935052!16s%2Fg%2F1hhmw77z8?entry=ttu Pôle Jakez Helias], in the room called « salle du conseil ».




Ligne 7 : Ligne 7 :
    
    


The aim of this workshop is to facilitate a meeting of minds between linguists and developers of technologies for Breton and Brittonic languages. Our objective is to foster a deeper understanding of each other's achievements and to build our collective capacity in this field.
The aim of this workshop is to facilitate a meeting of minds between linguists and developers of technologies for Breton and Brittonic languages. Our objective is to foster a deeper understanding of each others' achievements and to build our collective capacity in this field.
The date has been chosen to fall between the [http://celticstudents.blogspot.com/p/conference.html Celtic Student Conference] in Brest from May 30 to June 1, and the CRBC [https://nouveau.univ-brest.fr/breton-summer-school-crbc/fr/donnees-personnelles Breton Summer School] in English beginning on June 10 in Quimper.
The date has been chosen to fall between the [http://celticstudents.blogspot.com/p/conference.html Celtic Student Conference] in Brest from May 30 to June 1, and the CRBC [https://nouveau.univ-brest.fr/breton-summer-school-crbc/fr/donnees-personnelles Breton Summer School] in English beginning on June 10 in Quimper.
It will be possible to follow the event on-line.
It will be possible to follow the event on-line. The language of exchange is preferably in English, but of course questions can be raised in either Breton or French and will be translated on the spot. The sessions will be from 9h30 to 12h30, and from 14h to 16h30 (CEST).
 
 
We are in the process of putting together a comprehensive program that includes various interventions and thematic sessions.
For registration, please contact [https://iker.cnrs.fr/melanie-jouitteau/ Mélanie Jouitteau] if you wish to take lunch with us on site (before May 31), or be sent the link to follow the talks on-line (before June 07).
 
 
== Who was there ? ==
 
Confirmed speakers and attendants on site :
: [https://nouveau.univ-brest.fr/hcti/fr/membre/liana-ermakova Liana Ermakova] (UBO), [https://loicgrobol.github.io/ Loic Grobol] (U. Paris Nanterre), [https://www.linkedin.com/in/johannes-heinecke-571a614/ Johannes Heinecke] (Orange), [https://iker.cnrs.fr/melanie-jouitteau/ Mélanie Jouitteau] (IKER, CNRS), Gweltaz Duval-Guennoc (independent), [https://www.linkedin.com/in/aentem/ Alan Entem] (independent), Reun Bideault (independent), [https://lacito.cnrs.fr/en/directory/tanguy-solliec/ Tanguy Solliec] (LACITO, CNRS), [https://perso.univ-rennes2.fr/myriam.guillevic Myriam Guillevic] (Celtic-BLM, U. Rennes II), David Lesvenan (Breizh Niverel), David Le Meur (Breizh Niverel), Milan Rezac (IKER, CNRS), [https://www.bangor.ac.uk/research-students/scse/leena-farhat-530877/ Leena Farhat] (U. Bangor), [https://research.bangor.ac.uk/portal/en/researchers/preben-vangberg(4f4fd74a-bf06-4747-9632-873e6732c09a).html Preben Vangsberg] (U. Bangor), Michel Mermet (U. Rennes II), Alan Kersaudy (independent, OPAB), Mariia Kolesnichenko (U. Rennes II), Xavier Marjou (Orange), Juluan ar Hoz ([https://tiarvro-gwengamp.bzh/annuaire/joomlannuaire/fiche/56-apprendre-le-breton-a-begard-avec-hent-don.html Hent Don], Bear), Gwenn Meynier ([https://drouizig.org/en/ An Drouizig]), Mevena Guillouzic-Gouret (cinémathèque de Bretagne), Damien Quiguer, Antoine Jamelot, [https://perso.univ-rennes2.fr/en/stefan.moal Stefan Moal] (Celtic-BLM, U. Rennes II), [http://www.lmba-math.fr/annuaires/fiche.html?nom=rannou Eric Rannou] (Laboratoire de Mathématiques de Bretagne Atlantique, UBO).
 
Confirmed on-line attendance:
: [https://www.lancaster.ac.uk/staff/ezzini/ Saad Ezzini] (U. Lancaster), [https://www.lancaster.ac.uk/staff/elhaj/index.html Mo El-haj] (U. Lancaster), [https://www.linkedin.com/in/nicolas-vigneron-3420ba1bb/?originalSubdomain=fr Nicolas Vigneron] (wikimedia), [https://orcid.org/0000-0002-5723-9819 Natasha Romanova] (U. Caen), [http://www.irisa.fr/prive/foret/ Annie Foret] (IRISA, U. Rennes I), Madeleine Adkins (independent researcher), Philippe Argouarc'h (ABP), Anthony Vitt, [http://sambigeard.com/ Sam Bigeard] (INRIA, COLAF), [https://jfmondon.wixsite.com/breizh/publications Jean-Francois Mondon] (U. Muskingum, Ohio), [https://pure.qub.ac.uk/en/persons/merryn-davies-deacon-2 Merryn Davies-Deacon] (Queen's University of Belfast), [https://github.com/OrianeN Oriane Nédey] (IR, INRIA), Seongwoo Kang (UBO), Jeanne Mégly (indépendante, Dastum).
 
= Program =
 
== 8h45 - ''Welcome coffee and pastries'' ==
 
== 9h15 - ''Introduction'' ==
 
The morning session will be chaired by Alice Millour (U. Paris 8), and the afternoon session by Mélanie Jouitteau (IKER, CNRS). Each presentation is 30min followed by 10min questions. Reun Bideault will be our great master of time, with hand signals to the speaker (-5min, -3min, -1min and STOP).  
 
=== 9h30-10h10 - Breton annotated corpora, Autogram report ===


Confirmed speakers and attendants so far:
'''Mélanie Jouitteau''' (IKER, CNRS), for the [https://autogramm.github.io/ ANR funded team Autogramm] lead by Sylvain Kahane (Modyco, CNRS, Paris) with, for Breton, Bruno Guillaume (LORIA, INRIA), Kim Gerdes (LISN!, CNRS) et Loic Grobol (Modyco, CNRS et Université Paris Nanterre), and the [[TAL]] master projects of Salomé Chandora, Katharine Jiang, Aurélien Said Housseini (2022-2023), Yingzi Liu and Yidi Huang (2023-2024).
: [https://nouveau.univ-brest.fr/hcti/fr/membre/liana-ermakova Liana Ermakova] (UBO), [https://loicgrobol.github.io/ Loic Grobol] (U. Paris Nanterre), [https://www.linkedin.com/in/johannes-heinecke-571a614/ Johannes Heinecke] (Orange), [https://iker.cnrs.fr/melanie-jouitteau/ Mélanie Jouitteau] (IKER, CNRS), Gweltaz Duval-Guennoc (indépendant), [https://www.linkedin.com/in/aentem/ Alan Entem] (indépendant), [https://lacito.cnrs.fr/en/directory/tanguy-solliec/ Tanguy Solliec] (LACITO, CNRS)
: 30 min + 10 min questions, [[Jouitteau & al. (2024a)|résumé en français]]


I report on a two years project for [[Breton treebank II]], a group project aiming at building an annotated ''Universal Dependencies'' (UD) corpus ([[De Marneffe & al. (2021)|De Marneffe & al. 2021]], [[Nivre & al. (2020)|Nivre & al. 2020]]), based on data extracted from the ARBRES wikigrammar ([[Jouitteau (2009-)|Jouitteau 2009-]]). The work consists in extracting data from the ARBRES wikigrammar, organizing them in the [https://universaldependencies.org/docs/format.html Conll-U format], which is readable for the creation of a richly annotated corpus. This Conll format is completed by instructing it in dependencies. The coding is in the SUD format, with automatized translation into the UD format. The extraction is accessible at [https://arboratorgrew.elizia.net/?#/projects/keuneud_breton here on github], and the enrichment on [https://arboratorgrew.elizia.net/?#/projects/keuneud_breton Arborator].


== Translation ==
The first 2022 extraction, 'Kenstur' had obtained a small aligned corpus of high linguistic diversity that has been used for the development of two AI trainings for Breton<->French translations ([[Grobol (2022-)|Grobol 2022-]], [[OPLB & al. (2022)|OPLB & al. 2022]]). The first feedback on these trainings suggests that this type of high diversity corpus improves results for training on small resource sets ([[Grobol & Jouitteau (2024a)]], Entem p.c.). The 2024 extraction, 'Keneud', is somewhat bigger, organized by dialects, and includes some gloss annotations. A parser trained on a corrected version of [[Tyers & Ravishankar (2018)]] has pre-annotated the dependencies, with an adaptation for it to assign dominance to the [[rannig]] of each sentence in SUD. We add coding of consonant mutations. 


=== Translation: state of the art and going forward ===
* [[Grobol & Jouitteau (2024a)|Grobol, Loïc, et Mélanie Jouitteau. 2024a]]. 'ARBRES Kenstur: A Breton-French Parallel Corpus Rooted in Field Linguistics', ''Proceedings of the Fourteenth Language Resources and Evaluation Conference'', European Language Resource Association (ELRA).
* [[Tyers & Ravishankar (2018)|Tyers, Francis M. & Vinit Ravishankar. 2018]]. 'A prototype dependency treebank for Breton', ''Actes de la conférence Traitement Automatique de la Langue Naturelle'', TALN 2018, 197-204. [http://talnarchives.atala.org/TALN/TALN-2018/25.pdf texte]. [https://github.com/UniversalDependencies/UD_Breton-KEB 2023 corrected version on github].
 
=== 10h10-10h50 - Translation: state of the art and going forward ===


'''Loic Grobol''' Modyco, U. Paris Nanterre, avec Sarah Almeida Barreto U. Sorbonne Nouvelle, Mélanie Jouitteau (IKER, CNRS)
'''Loic Grobol''' Modyco, U. Paris Nanterre, avec Sarah Almeida Barreto U. Sorbonne Nouvelle, Mélanie Jouitteau (IKER, CNRS)
: 30 min + 10 min questions, [[Grobol, Almeida Baretto & Jouitteau (2024)|résumé en français]]


Le premier traducteur automatique pour le breton ([[Tyers (2009)|Tyers, 2009]]) et le corpus parallèle qui l'accompagne auront 15 ans cette année. Ses performances modestes montraient déjà qu'un tel système était possible et pouvait être utile, au moins comme aide partielle à la compréhension pour les non-locuteurs. Depuis, quelques travaux proposant des améliorations ont été publiés ([[Sánchez-Cartagena & al. (2015)|Sánchez-Cartagena & al. 2015]], [[Sánchez-Cartagena & al. (2020)|2020]]), mais sans mise à disposition de logiciels ou de ressources utilisables. Pendant quinze ans, le breton n'a ainsi pas réellement bénéficié des progrès majeurs de la traduction automatique. Grobol et Jouitteau (2024) ont ensuite publié nouveau corpus parallèle extrait de la wikigrammaire ARBRES ([[Jouitteau (2009-)|Jouitteau, 2009-2024]]) et d'un traducteur automatique moderne, aux performances significativement améliorées. Les modèles aux entrainements non-documentés et aux ressources opaques sont évidemment ici hors-sujet car ils ne nourrissent pas les avancées des modèles futurs. Le breton fait également partie des langues annoncées comme qualitativement prises en charge par certains traducteurs multilingues (GPT3.5, Baidu, etc.), mais ils profitent principalement juste de la carence en matériel d’évaluation robuste pour le breton, et de rapport de force conséquent pour les imposer. En l’état, pour les développeurs qui ne volent pas leurs données aux communautés parlantes, les performances restent bien en deçà de celles de traducteurs pour des langues bien dotées, et les corpus parallèles de breton restent dispersés, mal documentés, et de qualité incertaine.
The first automatic translator for Breton ([[Tyers (2009)|Tyers, 2009]]) and the associated parallel corpus will be 15 years old this year. While its performances were modest, it already showed that such a system was possible and could at least partially help non-speakers understand Breton. Since then, a few works proposing improvements have been published ([[Sánchez-Cartagena & al. (2015)|Sánchez-Cartagena & al. 2015]], [[Sánchez-Cartagena & al. (2020)|2020]]), but no usable software or resources have been made available. For fifteen years, Breton did not really benefit from the major advances in machine translation. [[Grobol & Jouitteau (2024a)]] then published a new parallel corpus extracted from the ARBRES wikigrammar ([[Jouitteau (2009-)|Jouitteau, 2009-2024]]) and a modern translation system, with significantly improved performance. Models with undocumented training and opaque resources are obviously off-topic here, as they don't feed into the advances of future models. Breton is also one of the languages announced as qualitatively supported by certain multilingual translators (GPT3.5, Baidu, etc.), but these are mainly just taking advantage of the lack of robust evaluation material for Breton, and the lack of consequent balance of power to impose them ([[Jouitteau & Grobol (2024a)|Jouitteau & Grobol 2024a]]). As it stands, for developers who don't steal their data from speaking communities, performance remains well below that of translators for high-resource languages, and Breton parallel corpora remain scattered, poorly documented, and of uncertain quality.


Cette présentation rend compte des travaux actuels du stage de master II de Sarah Almeida Barreto (Sorbonne nouvelle), dirigé par Loic Grobol (U. Paris Nanterre), en consultation avec Mélanie Jouitteau (IKER, CNRS). Nous présentons un inventaire complet des corpus parallèles existants, en les soumettant à une évaluation stricte pour constituer un corpus aussi complet que possible et en le soumettant à des évaluations systématiques pour nous assurer de sa qualité. Ces ressources sont mises à disposition en ligne en paquets téléchargeables, et recensées [https://entrelangues.modyco.fr/index.php/Breton#Ressources_num%C3%A9riques sur le site Entrelangues] où leurs métadonnées peuvent être discutées par les locuteurs. Nous espérons pouvoir présenter en juin le résultat d’un premier entrainement. Ce travail permettra à tou.te.s de développer des nouveaux systèmes de traduction de meilleure qualité, de concevoir des jeux de données d'évaluation qui pourront à l'avenir servir de standards, mais également d'identifier clairement les besoins en ressources pour la traduction vers et du breton afin de guider les futurs travaux de collecte de données.
This presentation reports on the current work of Sarah Almeida Barreto's Master thesis (Sorbonne nouvelle), directed by Loic Grobol (U. Paris Nanterre), in consultation with Mélanie Jouitteau (IKER, CNRS). We present a comprehensive inventory of existing parallel corpora, subjecting them to strict evaluation to build up as complete a corpus as possible, and subjecting it to systematic assessments to ensure its quality. These resources are made available online in downloadable packages, and listed [https://entrelangues.modyco.fr/index.php/Breton#Ressources_num%C3%A9riques on the Entrelangues website] where their metadata can be discussed by speakers. We hope to be able to present the results of a first training in June. This work will enable us all to develop new, higher-quality translation systems, to design evaluation datasets that can be used as standards in the future, and also to clearly identify the resource requirements for translation into and from Breton in order to guide future data collection and curation work.


* [[Grobol & Jouitteau (2024)|Grobol, Loïc, et Mélanie Jouitteau. 2024]]. 'ARBRES Kenstur: A Breton-French Parallel Corpus Rooted in Field Linguistics', ''Proceedings of the Fourteenth Language Resources and Evaluation Conference'', European Language Resource Association (ELRA).
* [[Grobol & Jouitteau (2024a)|Grobol, Loïc, et Mélanie Jouitteau. 2024a]]. 'ARBRES Kenstur: A Breton-French Parallel Corpus Rooted in Field Linguistics', ''Proceedings of the Fourteenth Language Resources and Evaluation Conference'', European Language Resource Association (ELRA).
* [[Jouitteau (2009-)|Jouitteau, Mélanie. 2009–2024]]. « ARBRES, wikigrammaire des dialectes du breton et centre de ressources pour son étude linguistique formelle ». 2009–2024. http://arbres.iker.cnrs.fr.
* [[Jouitteau (2009-)|Jouitteau, Mélanie. 2009–2024]]. « ARBRES, wikigrammaire des dialectes du breton et centre de ressources pour son étude linguistique formelle ». 2009–2024. http://arbres.iker.cnrs.fr.
* [[Jouitteau & Grobol (2024a)|Jouitteau, Mélanie & Loic Grobol. 2024a]]. 'Petits oublis, grands effets : le silençage des communautés linguistiques minorisées dans le TAL et ses conséquences', Karën Fort, Aurélie Névéol (éds.), ''Ethics and NLP: 10 years after'', Journée d’études ATALA ''éthique et TAL : 10 ans après'', 2024. [https://inria.hal.science/hal-04533870/document#page=37 hal-04533870].
* Sánchez-Cartagena, Víctor M., Mikel L. Forcada, et Felipe Sánchez-Martínez. 2020. « A multi-source approach for Breton–French hybrid machine translation ». In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, 61‑70. Lisboa, Portugal: European Association for Machine Translation. https://aclanthology.org/2020.eamt-1.8.
* Sánchez-Cartagena, Víctor M., Mikel L. Forcada, et Felipe Sánchez-Martínez. 2020. « A multi-source approach for Breton–French hybrid machine translation ». In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, 61‑70. Lisboa, Portugal: European Association for Machine Translation. https://aclanthology.org/2020.eamt-1.8.
* Sánchez-Cartagena, Víctor M., Juan Antonio Pérez-Ortiz, et Felipe Sánchez-Martínez. 2015. « A Generalised Alignment Template Formalism and Its Application to the Inference of Shallow-Transfer Machine Translation Rules from Scarce Bilingual Corpora ». Computer Speech & Language, Hybrid Machine Translation: integration of linguistics and statistics, 32 (1): 46‑90. https://doi.org/10.1016/j.csl.2014.10.003.
* Sánchez-Cartagena, Víctor M., Juan Antonio Pérez-Ortiz, et Felipe Sánchez-Martínez. 2015. « A Generalised Alignment Template Formalism and Its Application to the Inference of Shallow-Transfer Machine Translation Rules from Scarce Bilingual Corpora ». Computer Speech & Language, Hybrid Machine Translation: integration of linguistics and statistics, 32 (1): 46‑90. https://doi.org/10.1016/j.csl.2014.10.003.
Ligne 34 : Ligne 59 :
* [[Tyers (2009)|Tyers, Francis M. 2009]]. « Rule-Based Augmentation of Training Data in Breton-French Statistical Machine Translation ». In Proceedings of the 13th Annual conference of the European Association for Machine Translation. European Association for Machine Translation. https://aclanthology.org/2009.eamt-1.29.
* [[Tyers (2009)|Tyers, Francis M. 2009]]. « Rule-Based Augmentation of Training Data in Breton-French Statistical Machine Translation ». In Proceedings of the 13th Annual conference of the European Association for Machine Translation. European Association for Machine Translation. https://aclanthology.org/2009.eamt-1.29.


== annotated corpora, Universal Dependencies, and dependency parsing ==
== 10h50-11h10 - ''coffee break'' ==


=== Breton annotated corpora, Autogram report ===
=== 11h10-11h50 - Evaluating translations ===


'''Mélanie Jouitteau''' (IKER, CNRS), for [https://autogramm.github.io/ ANR Autogramm], an ANR funded team lead by Sylvain Kahane (Modyco, CNRS, Paris) with, for Breton, Bruno Guillaume (LORIA, INRIA), Kim Gerdes (LISN!, CNRS) et Loic Grobol (Modyco, CNRS et Université Paris Nanterre), and the [[TAL]] master projects of Salomé Chandora, Katharine Jiang, Aurélien Said Housseini (2022-2023), Yingzi Liu and Yidi Huang (2023-2024).
'''Liana Ermakova''', HCTI (UBO), '''Myriam Guillevic''' (U. Rennes II) and '''Mélanie Jouitteau''' (IKER, CNRS)
: 30 min + 10 min questions, [[Ermakova & al. (2024)|résumé en français]]


I present a two years project report for [[Breton treebank II]], a collective project aiming at building an annotated ''Universal Dependencies'' (UD) corpus ([[De Marneffe & al. (2021)|De Marneffe & al. 2021]], [[Nivre & al. (2020)|Nivre & al. 2020]]), based on data extracted from the ARBRES wikigrammar ([[Jouitteau (2009-)|Jouitteau 2009-]]). The work consists in extracting data from the ARBRES wikigrammar, organizing them in the [https://universaldependencies.org/docs/format.html Conll-U format], which is readable for the constitution of the richly annotated corpus, and complete this Conll format by instructing it in dependencies. Coding is operated in SUD format, with automatized translation into the UD format. The extraction is accessible [https://arboratorgrew.elizia.net/?#/projects/keuneud_breton here on github], and the enrichment on [https://arboratorgrew.elizia.net/?#/projects/keuneud_breton Arborator].
Liana Ermakova will present automatized evaluations of a range of translation models for the task of Breton to French and French to Breton translation.  
Jouitteau and Guillevic will present a qualitative evaluation set for translations, comprising a wide spectrum of Breton varieties. This set will be distributed after taking into account the feedback of the audience.


The first 2022 extraction, 'Kenstur' had obtained a small aligned corpus that has been used for the development of two separate AI trainings for Breton<->French translations ([[Grobol (2022-)|Grobol 2022-]], [[OPLB & al. (2022)|OPLB & al. 2022]]). The first feedbacks on trainings suggest that this type of high diversity corpus improves results for training on small resource sets ([[Grobol & Jouitteau (2024)]], Entem p.c.). The 2024 extraction, 'Keneud', is somewhat bigger, organized by dialects, and includes some of the gloses annotations. A parser trained on a corrected version of [[Tyers & Ravishankar (2018)]] has pre-annotated the dependencies, with an adaptation for it to assign dominance to the [[rannig]] of each sentence in SUD. 
=== 11h50 - 12h30 - Multilingual Dependency Parsing for Celtic languages and its neighbouring languages ===
 
* [[Grobol & Jouitteau (2024)|Grobol, Loïc, et Mélanie Jouitteau. 2024]]. 'ARBRES Kenstur: A Breton-French Parallel Corpus Rooted in Field Linguistics', ''Proceedings of the Fourteenth Language Resources and Evaluation Conference'', European Language Resource Association (ELRA).
* [[Tyers & Ravishankar (2018)|Tyers, Francis M. & Vinit Ravishankar. 2018]]. 'A prototype dependency treebank for Breton', ''Actes de la conférence Traitement Automatique de la Langue Naturelle'', TALN 2018, 197-204. [http://talnarchives.atala.org/TALN/TALN-2018/25.pdf texte]. [https://github.com/UniversalDependencies/UD_Breton-KEB 2023 corrected version on github].
 
=== Multilingual Dependency Parsing for Celtic languages and its neighbouring languages ===


'''Johannes Heinecke''', Orange  
'''Johannes Heinecke''', Orange  
: 30 min + 10 min questions, [[Heinecke (2024)|résumé en français]]


Dependency parsing is a typical NLP task which takes plain sentences as input and generates dependency syntax trees as output. Currently, we deploy dependency parsing in a tool chain for preprocessing customer and employee comments on products and services, in order to classify thematically. POS tagging and dependency parsing is used to identify easily "who did what" and to create nominal groups as keywords (instead of simple words). In the past, handcrafted rules and lexicons where written to make a parser work. Later statistical approaches proved far more efficient, both for transition-based and graph-parsers. Recently, notably since the advent of word-embeddings (like Word2Vec) and later context aware word embeddings such as obtained from language models like BERT, graph-parsers proved to be even better.
Dependency parsing is a typical NLP task which takes plain sentences as input and generates dependency syntax trees as output. Currently, we deploy dependency parsing in a tool chain for preprocessing customer and employee comments on products and services, in order to classify thematically. POS tagging and dependency parsing is used to identify easily "who did what" and to create nominal groups as keywords (instead of simple words). In the past, handcrafted rules and lexicons where written to make a parser work. Later statistical approaches proved far more efficient, both for transition-based and graph-parsers. Recently, notably since the advent of word-embeddings (like Word2Vec) and later context aware word embeddings such as obtained from language models like BERT, graph-parsers proved to be even better.
All statistical based approaches to dependency parsing need, training data. The Universal Dependency (UD) project provides the needed data in form of 150 treebanks in over a hundred languages. Even though some treebanks are very small (as for instance the Breton treebank [[Breton KEB]]), others are rich. In case of little or no treebank data, transfer learning on similar languages can be successful, notably with the UD data: UD data has been annotated using a single set of guidelines for all languages. For instance, the set of possible part-of-speech tags, dependency relations or morpho-syntactic features are defined universally.
All statistical based approaches to dependency parsing need, training data. The Universal Dependency (UD) project provides the needed data in form of 150 treebanks in over a hundred languages. Even though some treebanks are very small (as for instance the Breton treebank [[Breton KEB]]), others are rich. In case of little or no treebank data, transfer learning on similar languages can be successful, notably with the UD data: UD data has been annotated using a single set of guidelines for all languages. For instance, the set of possible part-of-speech tags, dependency relations or morpho-syntactic features are defined universally.
Most treebanks are monolingual, if expression from other languages like film titles or geographic names which can occur in the data are not counted as bi- or multilingual. In the real world, especially for speakers of Celtic languages, code switching is everywhere. We present a multilingual dependency parsing model (graph-parser) which can parse any mixture of Welsh, Irish, Scottish-Gaelic, Manx with English or French without losing much quality with respect to a monolingual model.
Most treebanks are monolingual, if expression from other languages like film titles or geographic names which can occur in the data are not counted as bi- or multilingual. In the real world, especially for speakers of Celtic languages, code switching is everywhere. We present a multilingual dependency parsing model (graph-parser) which can parse any mixture of Welsh, Irish, Scottish-Gaelic, Manx with English or French without losing much quality with respect to a monolingual model.
== 12h30-14h - ''lunch break'' ==
Lunch will be shared on site (plenty of vegetarian options).
=== 14h00-14h40 - An Overview of Breton Audio Material on Cocoon ===
'''Tanguy Solliec''', LACITO CNRS
: 30 min + 10 min questions, [[Solliec (2024)|résumé en français]]
Cocoon: A Repository for Speech Recordings; An Overview of Breton Language Material
[https://cocoon.huma-num.fr/exist/crdo/ Cocoon], short for COllections de COrpus Oraux Numériques is a digital repository which offers support to researchers in elaborating oral corpora and in archiving audio material collected during their research activities. This resource is developed by the CNRS as a part of the HUMA-NUM digital ecosystem. It contributes to the [[open science]] and to the open data movements and, more broadly, to the preservation of some aspects of intangible heritage.
The audio or video files present on the Cocoon web platform are organized into different thematic collections. Several of these are dedicated to Breton and were produced during different dialectological fieldworks. Although other audio collections of Breton language recordings are available on other platforms, the Cocoon material is associated to systematic OLAC metadata.
Even though the metadata associated to this material is generally well detailed, little documentation or transcriptions are available or attached to these recordings, for a variety of reasons. The Cocoon repository provides raw material containing Breton data to various degrees. These research files are then heterogeneous and a typology has to be developed to better describe their content, with a view to reuse for other purposes. In order to evaluate the material available and to identify the files best suited to further tasks, different criteria can be taken into account:
: -main language used in the recording (French with Breton words, interview conducted in Breton…)
: -quality of the recording
: -content of recordings
: -duration, possibility to cut into shorter chunks
: -sociolinguistic context and fluency of the speakers
: -number of speakers involved
: -presence of annotations and/or transcriptions
The repository Cocoon focuses primarily on the conservation and the access to the material it stores. However, the platform does not allow enhancement of this data with additional content afterwards. In this context, how could this “bad data” contribute to the development of language technologies resources for a (relatively) low resource language such as Breton? Initiatives like the ongoing research project [https://pages.llf-paris.fr/~deeptypo/ DeepTypo] (LLF, Paris) aimed at providing automatic transcriptions and extracting meaningful information from small corpuses offers interesting insights.
In the case of the Cocoon material, the very first step is to link the published documents, the available digitized information to the recordings. The ''[[NALBB|Nouvel Atlas Linguistique de la Basse Bretagne]]'' ([[Le Dû (2001)|Le Dû 2001]]) is actually a good illustration of possible steps forwards. Given the uneven content of the recordings, a second line of work is to build a methodology in order to detect to what extent Breton is used in the recordings and to rank them accordingly. Different approaches provide Speech-to-Text solutions for automatic transcriptions. Testing them on fieldwork recordings will give a glimpse of their potential. It will also be an opportunity to inquire whether and how this raw data can contribute to their improvement.
* [[Le Dû (2001)|Le Dû, Jean. 2001]]. Nouvel atlas linguistique de la Basse-Bretagne. 2 volumes. 601 maps, Centre de Recherche Bretonne et Celtique, Université de Bretagne Occidentale, Brest.
=== 14h40-15h10 - Anaouder, Developing ASR tools for Breton ===
'''Gweltaz Duval-Guennoc''', independent developer
: 30 min + 10 min questions, [[Duval-Guennoc (2024)|résumé en français]]
Automatic Speech Recognition (ASR) systems can be invaluable resources for the Breton community, benefiting both learners and proficient speakers. Technologies like SMS dictation on smartphones or automatic captioning could potentially enhance exposure to the language by incorporating it into everyday handheld devices and equipping content creators with better tools.
Several notable initiatives to develop Speech-To-Text models for Breton have been made ([https://huggingface.co/alanoix/whisper-small-br Alan Entem], [https://huggingface.co/BlueRaccoon/whisper-medium-br Holly Montalvo] ''BlueRacoonTech'', [https://huggingface.co/Marxav/wav2vec2-large-xlsr-53-breton Xavier Marjou]), but we are still far from having dependable and user-friendly software for end-users. One such initiative is [[Duval-Guennoc (2022-)|Anaouder]], which focuses particularly on creating on-device solutions to prioritize user privacy and autonomy.
We have been training models using the Kaldi framework with approximately 60 hours of transcribed audio data. These models are integrated into a Python module that offers command-line tools for real-time and continuous inference from a microphone or audio files, utilizing [https://alphacephei.com/vosk Vosk] as a backend. The code and models are available under a MIT open-source license on [https://github.com/gweltou/vosk-br GitHub] and [https://pypi.org/project/anaouder/ PyPi].
In addition, we have developed a lightweight rule-based NLP [https://github.com/gweltou/ostilhou toolkit] to streamline textual and audio data processing. This toolkit includes features such as sentence segmentation, pre-tokenization, phonetization, text normalization, and inverse-normalization. Rule-based NLP tasks remain highly relevant for low-resource languages like Breton.
Lastly, we present a desktop application in its early stages of development. This application aims to simplify the creation of transcripts in Breton while still being versatile enough to assist in data alignment and verification for the development of Breton text-audio corpora.
== 15h10-15h30 - ''coffee break'' ==
=== 15h30-16h30 - Round table and discussion - Financing opportunities for research and development ===
: 30 min flash presentations of the different options, and 30 min fruitful discussion
Round table with a view on funding possibilities in the UK (EPSRC) and France (ANR, CIFRE). With the participation of local actors like the Public Office of the Breton Language (OPAB/Rannvro, pending acceptation of invitation) and the endowment fund [https://bretagnenumerique.bzh/ Breizh Niverel] (David Lesvenan & David Le Meur, confirmed).

Version actuelle datée du 25 octobre 2024 à 06:35

The CNRS laboratory IKER and the U. Bretagne Ouest UBO are organizing the 2024 Workshop on Breton Language Technologies, which will take place at the University of Quimper (Brittany) on June 8, Pôle Jakez Helias, in the room called « salle du conseil ».


Contacts :

Mélanie Jouitteau: melanie.jouitteau at iker.cnrs.fr
Milan Rezac: milan.rezac at iker.cnrs.fr


The aim of this workshop is to facilitate a meeting of minds between linguists and developers of technologies for Breton and Brittonic languages. Our objective is to foster a deeper understanding of each others' achievements and to build our collective capacity in this field. The date has been chosen to fall between the Celtic Student Conference in Brest from May 30 to June 1, and the CRBC Breton Summer School in English beginning on June 10 in Quimper. It will be possible to follow the event on-line. The language of exchange is preferably in English, but of course questions can be raised in either Breton or French and will be translated on the spot. The sessions will be from 9h30 to 12h30, and from 14h to 16h30 (CEST).

For registration, please contact Mélanie Jouitteau if you wish to take lunch with us on site (before May 31), or be sent the link to follow the talks on-line (before June 07).


Who was there ?

Confirmed speakers and attendants on site :

Liana Ermakova (UBO), Loic Grobol (U. Paris Nanterre), Johannes Heinecke (Orange), Mélanie Jouitteau (IKER, CNRS), Gweltaz Duval-Guennoc (independent), Alan Entem (independent), Reun Bideault (independent), Tanguy Solliec (LACITO, CNRS), Myriam Guillevic (Celtic-BLM, U. Rennes II), David Lesvenan (Breizh Niverel), David Le Meur (Breizh Niverel), Milan Rezac (IKER, CNRS), Leena Farhat (U. Bangor), Preben Vangsberg (U. Bangor), Michel Mermet (U. Rennes II), Alan Kersaudy (independent, OPAB), Mariia Kolesnichenko (U. Rennes II), Xavier Marjou (Orange), Juluan ar Hoz (Hent Don, Bear), Gwenn Meynier (An Drouizig), Mevena Guillouzic-Gouret (cinémathèque de Bretagne), Damien Quiguer, Antoine Jamelot, Stefan Moal (Celtic-BLM, U. Rennes II), Eric Rannou (Laboratoire de Mathématiques de Bretagne Atlantique, UBO).

Confirmed on-line attendance:

Saad Ezzini (U. Lancaster), Mo El-haj (U. Lancaster), Nicolas Vigneron (wikimedia), Natasha Romanova (U. Caen), Annie Foret (IRISA, U. Rennes I), Madeleine Adkins (independent researcher), Philippe Argouarc'h (ABP), Anthony Vitt, Sam Bigeard (INRIA, COLAF), Jean-Francois Mondon (U. Muskingum, Ohio), Merryn Davies-Deacon (Queen's University of Belfast), Oriane Nédey (IR, INRIA), Seongwoo Kang (UBO), Jeanne Mégly (indépendante, Dastum).

Program

8h45 - Welcome coffee and pastries

9h15 - Introduction

The morning session will be chaired by Alice Millour (U. Paris 8), and the afternoon session by Mélanie Jouitteau (IKER, CNRS). Each presentation is 30min followed by 10min questions. Reun Bideault will be our great master of time, with hand signals to the speaker (-5min, -3min, -1min and STOP).

9h30-10h10 - Breton annotated corpora, Autogram report

Mélanie Jouitteau (IKER, CNRS), for the ANR funded team Autogramm lead by Sylvain Kahane (Modyco, CNRS, Paris) with, for Breton, Bruno Guillaume (LORIA, INRIA), Kim Gerdes (LISN!, CNRS) et Loic Grobol (Modyco, CNRS et Université Paris Nanterre), and the TAL master projects of Salomé Chandora, Katharine Jiang, Aurélien Said Housseini (2022-2023), Yingzi Liu and Yidi Huang (2023-2024).

30 min + 10 min questions, résumé en français

I report on a two years project for Breton treebank II, a group project aiming at building an annotated Universal Dependencies (UD) corpus (De Marneffe & al. 2021, Nivre & al. 2020), based on data extracted from the ARBRES wikigrammar (Jouitteau 2009-). The work consists in extracting data from the ARBRES wikigrammar, organizing them in the Conll-U format, which is readable for the creation of a richly annotated corpus. This Conll format is completed by instructing it in dependencies. The coding is in the SUD format, with automatized translation into the UD format. The extraction is accessible at here on github, and the enrichment on Arborator.

The first 2022 extraction, 'Kenstur' had obtained a small aligned corpus of high linguistic diversity that has been used for the development of two AI trainings for Breton<->French translations (Grobol 2022-, OPLB & al. 2022). The first feedback on these trainings suggests that this type of high diversity corpus improves results for training on small resource sets (Grobol & Jouitteau (2024a), Entem p.c.). The 2024 extraction, 'Keneud', is somewhat bigger, organized by dialects, and includes some gloss annotations. A parser trained on a corrected version of Tyers & Ravishankar (2018) has pre-annotated the dependencies, with an adaptation for it to assign dominance to the rannig of each sentence in SUD. We add coding of consonant mutations.

10h10-10h50 - Translation: state of the art and going forward

Loic Grobol Modyco, U. Paris Nanterre, avec Sarah Almeida Barreto U. Sorbonne Nouvelle, Mélanie Jouitteau (IKER, CNRS)

30 min + 10 min questions, résumé en français

The first automatic translator for Breton (Tyers, 2009) and the associated parallel corpus will be 15 years old this year. While its performances were modest, it already showed that such a system was possible and could at least partially help non-speakers understand Breton. Since then, a few works proposing improvements have been published (Sánchez-Cartagena & al. 2015, 2020), but no usable software or resources have been made available. For fifteen years, Breton did not really benefit from the major advances in machine translation. Grobol & Jouitteau (2024a) then published a new parallel corpus extracted from the ARBRES wikigrammar (Jouitteau, 2009-2024) and a modern translation system, with significantly improved performance. Models with undocumented training and opaque resources are obviously off-topic here, as they don't feed into the advances of future models. Breton is also one of the languages announced as qualitatively supported by certain multilingual translators (GPT3.5, Baidu, etc.), but these are mainly just taking advantage of the lack of robust evaluation material for Breton, and the lack of consequent balance of power to impose them (Jouitteau & Grobol 2024a). As it stands, for developers who don't steal their data from speaking communities, performance remains well below that of translators for high-resource languages, and Breton parallel corpora remain scattered, poorly documented, and of uncertain quality.

This presentation reports on the current work of Sarah Almeida Barreto's Master thesis (Sorbonne nouvelle), directed by Loic Grobol (U. Paris Nanterre), in consultation with Mélanie Jouitteau (IKER, CNRS). We present a comprehensive inventory of existing parallel corpora, subjecting them to strict evaluation to build up as complete a corpus as possible, and subjecting it to systematic assessments to ensure its quality. These resources are made available online in downloadable packages, and listed on the Entrelangues website where their metadata can be discussed by speakers. We hope to be able to present the results of a first training in June. This work will enable us all to develop new, higher-quality translation systems, to design evaluation datasets that can be used as standards in the future, and also to clearly identify the resource requirements for translation into and from Breton in order to guide future data collection and curation work.

  • Grobol, Loïc, et Mélanie Jouitteau. 2024a. 'ARBRES Kenstur: A Breton-French Parallel Corpus Rooted in Field Linguistics', Proceedings of the Fourteenth Language Resources and Evaluation Conference, European Language Resource Association (ELRA).
  • Jouitteau, Mélanie. 2009–2024. « ARBRES, wikigrammaire des dialectes du breton et centre de ressources pour son étude linguistique formelle ». 2009–2024. http://arbres.iker.cnrs.fr.
  • Jouitteau, Mélanie & Loic Grobol. 2024a. 'Petits oublis, grands effets : le silençage des communautés linguistiques minorisées dans le TAL et ses conséquences', Karën Fort, Aurélie Névéol (éds.), Ethics and NLP: 10 years after, Journée d’études ATALA éthique et TAL : 10 ans après, 2024. hal-04533870.
  • Sánchez-Cartagena, Víctor M., Mikel L. Forcada, et Felipe Sánchez-Martínez. 2020. « A multi-source approach for Breton–French hybrid machine translation ». In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, 61‑70. Lisboa, Portugal: European Association for Machine Translation. https://aclanthology.org/2020.eamt-1.8.
  • Sánchez-Cartagena, Víctor M., Juan Antonio Pérez-Ortiz, et Felipe Sánchez-Martínez. 2015. « A Generalised Alignment Template Formalism and Its Application to the Inference of Shallow-Transfer Machine Translation Rules from Scarce Bilingual Corpora ». Computer Speech & Language, Hybrid Machine Translation: integration of linguistics and statistics, 32 (1): 46‑90. https://doi.org/10.1016/j.csl.2014.10.003.
  • Tyers, Francis. 2010. « Rule-based Breton to French machine translation ». In Proceedings of the 14th Annual Conference of the European Association for Machine Translation. Saint Raphaël, France: European Association for Machine Translation. https://aclanthology.org/2010.eamt-1.13.
  • Tyers, Francis M. 2009. « Rule-Based Augmentation of Training Data in Breton-French Statistical Machine Translation ». In Proceedings of the 13th Annual conference of the European Association for Machine Translation. European Association for Machine Translation. https://aclanthology.org/2009.eamt-1.29.

10h50-11h10 - coffee break

11h10-11h50 - Evaluating translations

Liana Ermakova, HCTI (UBO), Myriam Guillevic (U. Rennes II) and Mélanie Jouitteau (IKER, CNRS)

30 min + 10 min questions, résumé en français

Liana Ermakova will present automatized evaluations of a range of translation models for the task of Breton to French and French to Breton translation. Jouitteau and Guillevic will present a qualitative evaluation set for translations, comprising a wide spectrum of Breton varieties. This set will be distributed after taking into account the feedback of the audience.

11h50 - 12h30 - Multilingual Dependency Parsing for Celtic languages and its neighbouring languages

Johannes Heinecke, Orange

30 min + 10 min questions, résumé en français

Dependency parsing is a typical NLP task which takes plain sentences as input and generates dependency syntax trees as output. Currently, we deploy dependency parsing in a tool chain for preprocessing customer and employee comments on products and services, in order to classify thematically. POS tagging and dependency parsing is used to identify easily "who did what" and to create nominal groups as keywords (instead of simple words). In the past, handcrafted rules and lexicons where written to make a parser work. Later statistical approaches proved far more efficient, both for transition-based and graph-parsers. Recently, notably since the advent of word-embeddings (like Word2Vec) and later context aware word embeddings such as obtained from language models like BERT, graph-parsers proved to be even better. All statistical based approaches to dependency parsing need, training data. The Universal Dependency (UD) project provides the needed data in form of 150 treebanks in over a hundred languages. Even though some treebanks are very small (as for instance the Breton treebank Breton KEB), others are rich. In case of little or no treebank data, transfer learning on similar languages can be successful, notably with the UD data: UD data has been annotated using a single set of guidelines for all languages. For instance, the set of possible part-of-speech tags, dependency relations or morpho-syntactic features are defined universally. Most treebanks are monolingual, if expression from other languages like film titles or geographic names which can occur in the data are not counted as bi- or multilingual. In the real world, especially for speakers of Celtic languages, code switching is everywhere. We present a multilingual dependency parsing model (graph-parser) which can parse any mixture of Welsh, Irish, Scottish-Gaelic, Manx with English or French without losing much quality with respect to a monolingual model.

12h30-14h - lunch break

Lunch will be shared on site (plenty of vegetarian options).

14h00-14h40 - An Overview of Breton Audio Material on Cocoon

Tanguy Solliec, LACITO CNRS

30 min + 10 min questions, résumé en français


Cocoon: A Repository for Speech Recordings; An Overview of Breton Language Material

Cocoon, short for COllections de COrpus Oraux Numériques is a digital repository which offers support to researchers in elaborating oral corpora and in archiving audio material collected during their research activities. This resource is developed by the CNRS as a part of the HUMA-NUM digital ecosystem. It contributes to the open science and to the open data movements and, more broadly, to the preservation of some aspects of intangible heritage. The audio or video files present on the Cocoon web platform are organized into different thematic collections. Several of these are dedicated to Breton and were produced during different dialectological fieldworks. Although other audio collections of Breton language recordings are available on other platforms, the Cocoon material is associated to systematic OLAC metadata. Even though the metadata associated to this material is generally well detailed, little documentation or transcriptions are available or attached to these recordings, for a variety of reasons. The Cocoon repository provides raw material containing Breton data to various degrees. These research files are then heterogeneous and a typology has to be developed to better describe their content, with a view to reuse for other purposes. In order to evaluate the material available and to identify the files best suited to further tasks, different criteria can be taken into account:

-main language used in the recording (French with Breton words, interview conducted in Breton…)
-quality of the recording
-content of recordings
-duration, possibility to cut into shorter chunks
-sociolinguistic context and fluency of the speakers
-number of speakers involved
-presence of annotations and/or transcriptions

The repository Cocoon focuses primarily on the conservation and the access to the material it stores. However, the platform does not allow enhancement of this data with additional content afterwards. In this context, how could this “bad data” contribute to the development of language technologies resources for a (relatively) low resource language such as Breton? Initiatives like the ongoing research project DeepTypo (LLF, Paris) aimed at providing automatic transcriptions and extracting meaningful information from small corpuses offers interesting insights. In the case of the Cocoon material, the very first step is to link the published documents, the available digitized information to the recordings. The Nouvel Atlas Linguistique de la Basse Bretagne (Le Dû 2001) is actually a good illustration of possible steps forwards. Given the uneven content of the recordings, a second line of work is to build a methodology in order to detect to what extent Breton is used in the recordings and to rank them accordingly. Different approaches provide Speech-to-Text solutions for automatic transcriptions. Testing them on fieldwork recordings will give a glimpse of their potential. It will also be an opportunity to inquire whether and how this raw data can contribute to their improvement.


  • Le Dû, Jean. 2001. Nouvel atlas linguistique de la Basse-Bretagne. 2 volumes. 601 maps, Centre de Recherche Bretonne et Celtique, Université de Bretagne Occidentale, Brest.

14h40-15h10 - Anaouder, Developing ASR tools for Breton

Gweltaz Duval-Guennoc, independent developer

30 min + 10 min questions, résumé en français


Automatic Speech Recognition (ASR) systems can be invaluable resources for the Breton community, benefiting both learners and proficient speakers. Technologies like SMS dictation on smartphones or automatic captioning could potentially enhance exposure to the language by incorporating it into everyday handheld devices and equipping content creators with better tools.

Several notable initiatives to develop Speech-To-Text models for Breton have been made (Alan Entem, Holly Montalvo BlueRacoonTech, Xavier Marjou), but we are still far from having dependable and user-friendly software for end-users. One such initiative is Anaouder, which focuses particularly on creating on-device solutions to prioritize user privacy and autonomy.

We have been training models using the Kaldi framework with approximately 60 hours of transcribed audio data. These models are integrated into a Python module that offers command-line tools for real-time and continuous inference from a microphone or audio files, utilizing Vosk as a backend. The code and models are available under a MIT open-source license on GitHub and PyPi.

In addition, we have developed a lightweight rule-based NLP toolkit to streamline textual and audio data processing. This toolkit includes features such as sentence segmentation, pre-tokenization, phonetization, text normalization, and inverse-normalization. Rule-based NLP tasks remain highly relevant for low-resource languages like Breton.

Lastly, we present a desktop application in its early stages of development. This application aims to simplify the creation of transcripts in Breton while still being versatile enough to assist in data alignment and verification for the development of Breton text-audio corpora.

15h10-15h30 - coffee break

15h30-16h30 - Round table and discussion - Financing opportunities for research and development

30 min flash presentations of the different options, and 30 min fruitful discussion

Round table with a view on funding possibilities in the UK (EPSRC) and France (ANR, CIFRE). With the participation of local actors like the Public Office of the Breton Language (OPAB/Rannvro, pending acceptation of invitation) and the endowment fund Breizh Niverel (David Lesvenan & David Le Meur, confirmed).