How to use this website

De Arbres
Révision datée du 3 février 2024 à 18:32 par Mjouitteau (discussion | contributions) (→‎NLP uses of the corpus)
(diff) ← Version précédente | Voir la version actuelle (diff) | Version suivante → (diff)
version en français

This presentation is meant to the present the Breton ARBRES website, to provide you with a global view of its scope, and to help you to fully use its functionalities.

  • for human readers, as a wikigrammar of Breton dialects
  • for automatized treatments, as a database of annotated sentences


Some numbers

The ARBRES website is developed since 2007, and started having presence on-line in 2009.

It now receives more than 100 human visits per day (107 pers./d, average for last four months of 2023).

It offers, at the start of January 2024:

  • 10,238 pages, which includes 4,804 pages of content, 19 pages of presentation, and a battery of redirections pages.
3,094 articles on elements of Breton grammar
325 theoretical explanation sheets

The website organizes a database of about 15.000 original Breton sentences, glossed and translated into French, coming from :

1,208 research works on the Breton language (books, dictionaries, research articles, data collection blogs)
493 corpus references produced by native speakers (novels, newspaper articles, songs)
44 elicitation sessions with native speakers whose raw results are available online in the elicitation center

Goals of the ARBRES website

The ARBRES website aims at providing a comprehensive and detailed description of the language and its dialectal variations, from traditional dialects to Standard Breton, and to be more generally a resource center for the formal and descriptive study of the language, and for the development of Breton automatized treatments. It aims at being of use to different profiles :

For speakers and experts of the language:

  • a pedagogical resource for workers of the Breton language
  • a collaborative social experimentation organized around a highly endangered language
  • a resource of material for classic linguistic questions that speakers have on the language, ready-to-use on social media

For descriptive and theoretical linguists:

  • an accurate and theoretically informed description of Breton syntactic microvariation
  • a permanent up-to-date state-of-the-art of current linguistic researches
  • a source of linguistic expertise on the theoretical productions on Breton
  • an active international research tool

For language technology developers:

  • a ready-to-use database of richly annotated data on a low-resource language
  • a resource for usable digital resources on Breton
  • a source of language expertise on available Breton material


Means to reach these goals

ARBRES is incrementally built in relation with the speakers' community.

The database uses the wiki technology. It is fully open to collaborative writing and review. Each page is associated with a discussion page, and a fully traceable history of the contributions. The latest results of the scientific study of the Breton language, usually in English, are reported on, analyzed and integrated into the wikigrammar in French, a language that is much more accessible to Breton speakers than English.

ARBRES also provides different accessibility tools in order to build bridges between these communities:

  • a full on-line grammar of the Breton microdialectal variation, with a careful description of the dialectal and idiolectal facts, French translations of Breton data and clickable glosses.
  • two different table of contents for the Breton grammar, each designed for a particular readership. The first, called Breton grammar, is designed for learners, speakers and philology teachers. The second, called Formal grammar, is designed for theoretically oriented linguists.
  • site internal research facilities, like a research toolbox quantifying over the entire website (see top of page)
  • a Breton, French or English glossary of more than 250 terms of formal grammar, linked each to definitions illustrated by Breton data.
  • a list of abbreviations, symbols and acronyms used in this field, with explanations
  • an elicitation center by which the international linguistic research community can co-built protocols with a Breton expert, who operates next the elicitations and posts the protocol's raw results on-line.
  • an architecture of page categorisations
  • an up-to-date page on the history of Breton technology development, linked to an up-to-date summary of ready to use resources for developers.


ARBRES brings to linguists the functionalities of a research notebook, and science resource center.

  • raw data.
The elicitation center allows for linguists to co-build elicitations with the author and developer on demand. The results are posted on-line, and integrated into the wikigrammar.

A Breton grammar on-line

ARBRES can be used like a regular grammar printed on paper, by browsing the table of contents, or by clicking directly on one of its major sections:

 1. Morphology 
 2. Constituents
 3. Syntax of the sentence
 4. Information structure
 5. Discourse


Like a regular printed grammar, it can also be browsed at random, by clicking on article au hasard. Unlike a regular printed grammar, it can be searched in numerically, via the search box in the upper right corner of the screen, in either English or French.

Upon arriving at the article you want to consult, you will find first a brief summary with illustrative examples, followed by a table of contents. A fully developed article is structured as follows:

 1. Morphology
 1.1. Accentuation
 1.2. Consonant mutations
 1.3. Gender, number, person 
 2. Syntax
 2.1. Properties
 2.2. Distribution
 2.3. Associated elements
 3. Semantics 
 4. Diachrony 
 5. Typology
 6. Terminology
 7. Bibliography


The structure of an example in the grammar

For explanatory purposes, the examples below are glossed in English (in the grammar they would be in French). Every example is numbered. The first line is in Breton. Below come aligned glosses in French. Most glosses are active links, each leading to a full article with description and analysis. Glossing indicates partial morphological analysis, with affixes having their own active links. Super-scripted numbers in the glosses are also active links: they indicate the different consonant mutations that affect the initial consonants in Celtic languages (lenition is 1, mixed set of mutations is 4, etc.). The next line provides a global translation for the Breton sentence.


(1) Da belec'h eta e fell d'it ez kasfen ?
to1 where then R4 pleases to.you you.OBJECT would.send
'Where then do you want me to send you ?'
Léonard (Bodilis), Ar Floc'h (1922:347)


When available, an IPA transcription is also given, in green letters. All examples end by indicating the dialectal variety (in italics), and a reference indicating the source. The reference is an active link, and directs to a separate reference page, here for the unpublished thesis of Erwan Evenou (1987).


(2) [ wa kOmâsǝd ǝ rEzistâs nEm fòrmo ]
Oa komañset ar Rezistañs en em furmañ. Standard Orthography'
was start.ed the "résistance" [ SC self form ]
'The resistance had started to structure itself.'
East-Kerne (Lanvenegen), Evenou (1987:627)

Reference pages

Each example is thus linked to its precise source. Sources are active links toward a dedicated reference page. Each reference is associated with (when available):

  • complete bibliographical information
  • an active link toward the URL address of the reference
  • identification of the dialect(s) used or discussed
  • an abstract or summary
  • the publication history
  • reviews
  • erratum list
  • active links toward extracts of the reference (see for example: Kervella (1947))

A useful feature allows you to see exactly where a given reference has been cited within the website. In the reference article, go to the Outils [tools] box and click on pages liées [linked pages]. You will get the list of pages linking to the reference.

Explore a topic

After reading a given article, there are several options to explore further.


  • Try the clickable links inside the article to explore associated pages.
  • explore the bibliography:
If an article, a thesis or a book has been written on the topic, it should appear at the bottom of the page, in its dedicated bibliography. If this work has not been mentioned (yet), go to the general bibliography page in the Resource Center.
  • Find other pages of the same category
Each article belongs to one or more category. For example, the article on kalz, 'lot (of)', belongs to the categories determiner, adverb, quantifieur et indefinite. These are listed with links at the very bottom of the page.
Clicking on a given category gives you the list of all the pages from the same category.
  • Leave comments or questions on the "discussion" page associated with each article:
I answer comments and questions in a timely manner. I update the website accordingly, provide explanations, or develop new tools.

How to use the Resource Center

The Resource Center is meant to provide all type of information for research. It is always accessible through the left panel on the screen.

It provides the classical tools of printed grammars, and some less classical ones deriving from the choice of the digital medium and of the Open Science research paradigm.


Classic Tools

The bibliography is intended as a complete, up-to-date guide to the descriptive and scientific works on the syntax and morphosyntax of Breton. Whenever possible, active links are made available to documents on-line. The references used on the website are visible at a glance (they are active links toward the reference page).
Specialized journals like Hor Yezh or La Bretagne Linguistique have a dedicated page listing their contents.
  • A glossary of more than 250 technical terms used in formal grammar, and its version in French
In order to find the French translation of a term, follow the links in the English glossary - you will arrive at an article whose title is the French translation or appears in bold at the beginning of the article. For translation from French to English or Breton, search for the terminologie [terminology] subpart of the articles.
  • A list of technical abbreviations, as well as traditional acronyms for corpora and glosses.

What's new?

This website is evolutive, and can be used to share news about the study of Breton, internal or external to the website.

Here I announce calls for papers, conferences, important publications, news about the Breton language and also the lastest works on the website.
  • It is also possible to follow the history of one page in particular, or check the latest modifications on the website under Modifications Récentes.
  • A page of useful external links:
In English, Breton, French, there are links to explore the study of Breton and to electronic resources on Breton, the Celtic languages, and minority languages in general.

This is a participative website: you can post informations in the news page, add new references in the bibliography, or enrich the links pages.


Find out more about a particular dialect

The tools of the Resource Center are meant to be of help.

  • Find corpora or references
The dialects of works in the general bibliography have been geographically localized on a googlemap. This lets you visualize the geographical distribution of works on different varieties of Breton.
A grammar gives a special feature for a dialect and you want to check the facts? The map helps you find references for corpora and grammars closest to this place.
This list is far from exhaustive, but allows you to choose an author or work according to the dialect of interest.
  • Consult the list of different usable corpora in the Resource Center.
Provides references for audio transcripts, written, glossed, IPA transcripted corpora, etc.)

Learning

The system of clickable glosses makes of ARBRES a powerful tool for learning the language. One can surf from page to page, reading the Breton sentences and clicking on any item of the sentence she wants to learn more about.

One can search a lexical item in one of the different orthographic systems, or under a mutated form, and still find the page.


Wikigrammars and language resource building

A wikigrammar involves a wiki-based platform, including a descriptive and formal grammar of the language, open to contributions and discussions. The examples in the grammar are annotated. They are automatically retrievable, and as such constitutes a digital database of the language.

Wikigrammars provide language technology development with a very specific type of resource: a corpus that is by definition a concentrate of linguistic diversity.

Linguistically diverse by design

The data from the ARBRES wikigrammar were collected by a linguistics researcher. These data were gathered to underpin fundamental research in formal linguistics. In this sense, the data are those of a research notebook. Subsequently, the data were organized and significantly expanded with the objective of creating a descriptive grammar, accessible in its online and written form to the speaking community. The goal, therefore, was twofold: to produce a comprehensive description of a natural language, capturing its diversity, complexity, and regularities, and to provide new data that are pertinent to ongoing debates in fundamental generative linguistics research.

It contains the somewhat artificial sentences typical of grammars, but they are outweighed by other more natural ones, of varied informational structure.

For copyright reasons, the author could only take a modest percentage of the sentences for each published corpus. The effect is a widening of the variety of sources (newspaper articles, novels, songs, poems, collections of popular expressions, political leaflets, town hall presentation sites, posts on social networks, etc.).

The corpus includes elicitation data, a result of fieldwork for linguistic description purposes. The linguist has subjected native speakers to a protocol of questions, translations, descriptive tasks of images, or tasks of judgments of grammaticality of sentences which were proposed to them. Copyright on these sources is respected because the speakers provide informed consent on the dissemination of the results of the surveys, or where applicable, on the online distribution of their voice.

dialectal diversity

ARBRES is a grammar of dialects. Its corpus has a high dialectal diversity by design. This is a descriptive grammar, not a prescriptive grammar. Standard Breton is treated as a dialect among the others. The dialectal spectrum is therefore quite broad. The Gwenedeg dialect is specifically under-represented, with a relative deficit of data in this dialect, which is also linguistically the furthest from the others. Its analysis requires expertise where the main editor is sometimes lacking, and as a result less data represents this dialect. Aside from this particular deficiency in the Gwenedeg dialect, we can consider that quantitatively, rare dialect facts are over-represented in the data. The linguistic facts that are in the language will be illustrated once for each major dialect, but not beyond. On the contrary, to be able to precisely describe a rare fact, its dialect distribution and the parameters of their context of appearance, its examples will be provided for each existing occurrence. Rare facts are also more likely to be the subject of thematic elicitation research, which provides more data where they occur. For the same purpose of describing the variation, the forms of different styles will co-exist within the corpus, with a quantitative over-representation of this variation compared to any single corpus. In this sense, the ARBRES corpus is not well adapted for quantitative studies, but it offers a concentrate of grammatical diversity for automatic training.

diversity of orthographies

The presence of written corpus data from the 20th century means, in the case of Breton, the presence of several competitive orthographic systems. The source data has not been altered, and examples appear in their original printing spellings.

minimal pairs and negative evidence

The presence of elicitation data means the presence of data tagged as ungrammatical. To ensure that a precise fact is the key to the acceptability of a form, formal linguists establish minimal pairs. These pairs vary in only one feature. One output is grammatical, the other ungrammatical. To understand dialectal differences, it is also important to know how far in space a given form will be understood, or accepted. Beyond the dialect boundary of a given linguistic fact, speakers report its forms as ungrammatical. We then obtain minimal pairs {dialect A, OK form / dialect B, ungrammatical form}. The minimal pairs contained into the wikigrammar can extracted to form either training sets, or model evaluation sets.

NLP uses of the corpus xxx


corpus enrichment

The data from the ARBRES wikigrammar were collected and annotated by a linguistics researcher.

token to lemma

In wikitables, each form is connected to its equivalent in standard spelling. Each word of the sentences is glossed (translated as if found in isolation). This gloss is clickable for the user of the interface. Its redirect address is the spelling of its standard form. The multiplicity of spellings present, combined with the systematic linking of each occurrence to its standard lemma, allows for a high diversity in the data without being detrimental to their consistency. This very system that redirects the tokens towards their respective lemmas also makes it possible to connect the various word forms. This is key in this Celtic language, which not only show inflections by suffixation, but also modifications of the initial consonant depending on the syntactic contexts in which they appear (consonant mutations). The lemma krokodil can thus be automatically linked to its occurrences in krokodil Maia 'the crocodile of Maia', ar c'hrokodil 'the crocodile', ar c'hrokodiled 'the crocodiles', war grokodileta 'about to look for crocodiles'. In the wiki, all these occurrences point to the same page dedicated to the lemma krokodil. This page being categorized as a page concerning a noun, its grammatical category is also automatically recoverable. For a detailed description of the recoverable grammatical annotations, see Jouitteau & Bideault (2023) and the details of the data extraction project by AUTOGRAMM Breton treebank II.

cost estimates

Wikigrammars as a solution for data gathering is expensive in that it requires one or more people trained in the language with minimal dialectal flexibility, a social surface suitable for reaching speakers of different linguistic profiles, ways for them to find a non-monetary advantage in passing a linguistic protocol. This work also represents a long time of coding the examples and their adequate presentation in the grammar for a human readership. It requires technical support for the design and general maintenance of the site and its updates, and technical monitoring of its accessibility on screen for various users. However, all of these necessary resources exist outside the scope of NLP. At the community level, investment may be driven entirely by internal goals. The database incrementally builds an educational and/or scientific resource in a form adapted to its audience. On the scale of small language communities, this avoids monopolizing experts to create databases which would not be usable by the general public. The development of wikigrammars is particularly recommended for the construction of pilot project resources on languages with restricted corpus, because if the IT field fails to provide finalized tools for speakers, the investment will remain beneficial for the speaking community, which can truly continue to improve it for itself. In terms of human resources, descriptive and formal linguists set themselves the task of producing language analysis material. They are generally few in number in languages with a restricted corpus, but often have profiles that are very committed to their empirical domain and the speakers who produce it, with a detailed cultural knowledge of interactions with them. The wiki solution, for its part, is directly designed for large-scale collaboration of potentially isolated contributors, which is particularly suitable for minoritized languages.

Social engagement in minority languages

public engagement

Internal statistical tools, and especially google analytics, provide for a rather precise statistical representation of the way this website is used. The flow of anonymized data, more than 100 human visits a day, can be analyzed with some precision. One can see the favorite entry pages, those that receive the less engagement or the shortest reading times, or the particular requests made on research engines that led the readership to the grammar. Once the website has reached a good search-engine optimization and a critical size, the geographical sources of connections are also telling. The Breton wikigrammar is mostly used in Brittany and in the places of the diaspora. Its heaviest trafic is aligned with university calendar. The pages concerning the more technical material of formal linguistics, which displays basic formal linguistics informations in French, receive high peaks at typical exam periods in French-speaking countries (Switzerland, Marocco, Québec, Belgium, etc.). The data is so precise that the use of the resource can be seen at an international scale when Breton classes sporadically open, like in 2010 when Anna Mouradova started teaching Breton in Moscow. Conversely, one can see when the resource is not used (sporadic Breton classes in Harvard).

equip the interface between science and society

The Breton wikigrammar ARBRES is an experiment of open and participative science (see Jouitteau 2013b for an analysis of the early deployment). Wikigrammars bring the scientific process closer to the public. Like any other grammar in open access, it makes available the results of research at the end of its process at a given time. But does much more. In synchrony, it links the work to the used sources and to the scientific community. It also sheds light on the past of its making, and on the future of its making. We now illustrate these three dimensions.

Scientific monitoring makes it possible to feed the grammar with the results of the latest research. This effect only derives from its use as a research notebook. The external resources are summarized, referenced and, when open access allows for it, directly linked to. All of these operations bring the readership closer to scientific stakeholders, make them more understandable and more accessible. Sporadically, this invisible barrier can also just fall of. In 2014, the organization of the Redadeg [Race for the Breton language event] asked the translation of I speak Breton, and you? in different languages. In a few days, linguists from all over the world happily participated in contributing to the page I speak Breton, what about you?, bringing together translations of this sentence in 77 different languages. In support of the event, 1,695 Breton speakers posted self-portraits on-line with these sentences. The international community of linguists was rendered visible to the community, and conversely the Breton language appeared very concretely as a production of alive speakers to the scientists.

A wikigrammar contains the past of its making. It references the making of its own research. The wiki "historic" function offers on each page a full traceability of the process of knowledge building and data gathering: contributions, corrections, discussions, exploration of new datasets, integration of new bibliographic sources and new hypotheses on the rise being tested. Each page is associated with a complete history giving all modifications made to it since its creation. One can see trace back how is science done, how new data and new publications change our hypotheses. The diversity of contributors or lack thereof is visible. Every contribution is visible and can be duly credited.

Scientific research is the result of a methodology, and is at heart a process accessible to anyone, as long as the methodology is respected. Within these limits the wiki software is designed to allow for both cumulative collaboration (massive aggregation of small contributions into a single architecture), and a distributive collaboration (with differentiated tasks). Various competences can then come in together to build a strong resource for the community. This medium raises for the reader the question of his place in the process, allowing for a gradation of postures from passive to active (reading, commenting, correcting, providing input, writing, linking, etc.). The is particularly welcome in the case of minority languages, where speakers commonly report feelings of dispossession of what they consider their language.

Finally, let us point a marginal but beneficial effect. Society is rife with sometimes under-informed debates about minority languages, due to a lack of verifiable information, a lack of knowledge of real linguistic varieties, or an accumulation of inaccuracies. The site develops linguistic discussion articles which provide concrete elements of analysis on these debates. The digital format of these articles makes them directly shareable on social networks, in a format open to a scientific discussion, within the limits of scientific argumentation.

The object of science should never be reduced to civic work, because science has its own internal goals that are legitimate. However, when science can take on this citizen dimension by pursuing its own goals, when science is in need of data produced by this society, why miss on this opportunity ?

Bibliography

  • Jouitteau, Mélanie & Reun Bideault. 2023. 'Outils numériques et traitement automatique du breton', Annie Rialland, Michela Russo (dir.), Langues régionales de France: nouvelles approches, nouvelles méthodologies, revitalisation, Éditions de la Société de Linguistique de Paris, 37-74. texte.