2024 Workshop on Breton Language Technologies
The CNRS laboratory IKER is organizing the 2024 Workshop on Breton Language Technologies, which will take place at the University of Quimper (Brittany) on June 8.
Contacts :
- Mélanie Jouitteau: melanie.jouitteau at iker.cnrs.fr
- Milan Rezac: milan.rezac at iker.cnrs.fr
The aim of this workshop is to facilitate a meeting of minds between linguists and developers of technologies for Breton and Brittonic languages. Our objective is to foster a deeper understanding of each other's achievements and to build our collective capacity in this field.
The date has been chosen to fall between the Celtic Student Conference in Brest from May 30 to June 1, and the CRBC Breton Summer School in English beginning on June 10 in Quimper.
It will be possible to follow the event on-line.
We are in the process of putting together a comprehensive program that includes various interventions and thematic sessions.
Confirmed speakers and attendants so far:
- Liana Ermakova (UBO), Loic Grobol (U. Paris Nanterre), Johannes Heinecke (Orange), Mélanie Jouitteau (IKER, CNRS), Gweltaz Duval-Guennoc (indépendant), Alan Entem (indépendant), Tanguy Solliec (LACITO, CNRS)
UD and dependency parsing
Johannes Heinecke
Multilingual Dependency Parsing for Celtic languages and its neighbouring languages
Dependency parsing is a typical NLP task which takes plain sentences as input and generates dependency syntax trees as output. Currently, we deploy dependency parsing in a tool chain for preprocessing customer and employee comments on products and services, in order to classify thematically. POS tagging and dependency parsing is used to identify easily "who did what" and to create nominal groups as keywords (instead of simple words). In the past, handcrafted rules and lexicons where written to make a parser work. Later statistical approaches proved far more efficient, both for transition-based and graph-parsers. Recently, notably since the advent of word-embeddings (like Word2Vec) and later context aware word embeddings such as obtained from language models like BERT, graph-parsers proved to be even better. All statistical based approaches to dependency parsing need, training data. The Universal Dependency (UD) project provides the needed data in form of 150 treebanks in over hundred languages. Even though some treebanks are very small (as for instance the Breton treebank Breton KEB), others are rich. In case of little or no treebank data, transfer learning on similar languages can be successful, notably with the UD data: UD data has been annotated using a single set of guidelines for all languages. For instance, the set of possible part-of-speech tags, dependency relations or morpho-syntactic features are defined universally. Most treebanks are monolingual, if expression from other languages like film titles or geographic names which can occur in the data are not counted as bi- or multilingual. In the real world, especially for speakers of Celtic languages, code switching is everywhere. We present a multilingual dependency parsing model (graph-parser) which can parse any mixture of Welsh, Irish, Scottish-Gaelic, Manx with English or French without losing much quality with respect to a monolingual model.