Différences entre les versions de « How to use this website »

De Arbres
Ligne 16 : Ligne 16 :
The website organizes a database of about '''15.000 original Breton sentences, glossed and translated''' into French, coming from :
The website organizes a database of about '''15.000 original Breton sentences, glossed and translated''' into French, coming from :
: [[:Category:Ouvrages_de_recherche|1,208 research works]] on the Breton language (books, dictionaries, research articles, data collection blogs)
: [[:Category:Ouvrages_de_recherche|1,208 research works]] on the Breton language (books, dictionaries, research articles, data collection blogs)
: [[:Category:Références_de_corpus|492 corpus references]] produced by native speakers (novels, newspaper articles, songs)
: [[:Category:Références_de_corpus|493 corpus references]] produced by native speakers (novels, newspaper articles, songs)
: [[:Category:élicitations|44 elicitation sessions]] with native speakers whose raw results are available online in the [[elicitation center]]
: [[:Category:élicitations|44 elicitation sessions]] with native speakers whose raw results are available online in the [[elicitation center]]



Version du 31 décembre 2023 à 15:28

version en français

This presentation is meant to provide you with a global view of the scope of the Breton ARBRES website, and to help you to fully use its functionalities.

  • for human readers, as a wikigrammar of Breton dialects
  • for automatized treatments, as a database of annotated sentences


Some numbers

The ARBRES website is developed since 2007, and started having presence on-line in 2009. The last four months of 2023, it received more than 107 human visits per day. It offers, at the start of January 2024:

  • 10,238 pages, which includes 4,804 pages of content, 19 pages of presentation, and a battery of redirections pages.
3,094 articles on elements of Breton grammar
325 theoretical explanation sheets

The website organizes a database of about 15.000 original Breton sentences, glossed and translated into French, coming from :

1,208 research works on the Breton language (books, dictionaries, research articles, data collection blogs)
493 corpus references produced by native speakers (novels, newspaper articles, songs)
44 elicitation sessions with native speakers whose raw results are available online in the elicitation center

Goals of the ARBRES website

The ARBRES website aims at being :

For speakers and experts of the language:

  • a pedagogical resource for workers of the Breton language
  • a collaborative social experimentation organized around a highly endangered language
  • a resource of material for classic questions speakers have on the language, ready-to-use on social media

For descriptive and theoretical linguists:

  • an accurate and theoretically informed description of syntactic microvariation
  • a permanent up-to-date state-of-the-art of current linguistic researches
  • a source of linguistic expertise on the theoretical productions on Breton
  • an active international research tool

For language technology developers:

  • a ready-to-use database of richly annotated data on a low-resource language
  • a resource for usable digital resources on Breton
  • a source of language expertise on available Breton material


Means to reach the goals

In order to reach these goals, ARBRES provides :

  • a full on-line grammar of the Breton microdialectal variation, with a careful description of the dialectal and idiolectal facts, French translations of Breton data and clickable glosses.
  • two different indexes of the Breton grammar, each designed for a particular readership. The first, called Breton grammar, is designed for learners, speakers and philology teachers. The second, called Formal grammar, is designed for theoretically oriented linguists.
  • a research toolbox quantifying over the entire website
  • a Breton, French or English glossary of more than 250 terms of formal grammar, linked each to definitions illustrated by Breton data.
  • a list of abbreviations, symbols and acronyms used in this field, with explanations
  • an elicitation center by which the international syntactic research community can co-built protocols with a Breton expert, who operates next the elicitations and posts the protocol's raw results on-line.
  • an up-to-date general bibliography of Breton linguistics.
  • an architecture of page categorisations
  • an up-to-date page on the history of Breton technology development, linked to an up-to-date summary of ready to use resources for developers.


The database uses the wiki technology and is incrementally built in relation with the speakers' community. It is fully open to collaborative writing and review. Each page is associated with a discussion page, and a fully traceable history of the contributions.

Speakers and workers of the language have access to a comprehensive and detailed description of the language and its dialectal variations, from traditional dialects to Standard Breton, as well as a resource center for the formal and descriptive study of the language. The latest results of the scientific study of the Breton language (usually in English) are reported, analyzed and integrated into the wikigrammar (in French, a language accessible to Breton speakers). An elicitation center allows for linguists to co-build elicitations with the author and developer on demand. The results are posted online and integrated into the wikigrammar.

The Breton grammar on-line

This website contains a full descriptive and formal grammar of the Breton language in all its dialectal varieties.

It can be used like a regular grammar printed on paper, by browsing the table of contents, or by clicking directly on one of its major sections:

 1. Morphology 
 2. Constituents
 3. Syntax of the sentence
 4. Information structure
 5. Discourse

These five major sections will remain accessible during your entire navigation, on the left panel of the screen, Under Grammaire du breton.

Like a regular printed grammar, it can also be browsed at random, by clicking on article au hasard.

Unlike a regular printed grammar, it can be searched in numerically, via the search box in the upper right corner of the screen, in either English or French.

Upon arriving at the article you want to consult, you will find first a brief summary with illustrative examples, followed by a table of contents. A typical article looks as follows:

 1. Morphology
 1.1. Accentuation
 1.2. Consonant mutations
 1.3. Gender, number, person 
 2. Syntax
 2.1. Properties
 2.2. Distribution
 2.3. Associated elements
 3. Semantics 
 4. Diachrony 
 5. Typology
 6. Terminology
 7. Bibliography


Examples

Every example is numbered. The first line is in Breton. Below come glosses in French. Most glosses are active links, each leading to a full article with description and analysis. Glossing indicates partial morphological analysis, with affixes having their own active links. Super-scripted numbers in the glosses are also active links: they indicate the different consonant mutations (lenition in 1) that affect the initial consonants in Breton. The gloss line also indicates partial syntactic analysis, such as constituency.


For explanatory purposes, the examples here are glossed in English (in the grammar they would be in French):


(1) Sevel a reas ar paotr e zaoulagad …
raise.INF R did the boy [VP _ his1 2.eye ]
'The boy raised his eyes.'
Standard Breton, Drezen (1990:23)


The example ends with a global translation for the Breton sentence. Sometimes, when relevant, an alternative translation in the dialectal French of Low Brittany is provided above the translation in standard French.


(2) Me am-eus c'hoant da lavared penaoz ema ar wirionez gant ar skolaer !
me R.1SG has impulse to1 say how is the 1truth with the school.er
'I am inclined to say that the school teacher is right.' (except in French v)
'Moi, je prétends que l'instituteur a raison.'
Treger Breton, Gros (1984:176)


When available, an IPA transcription is also given, in green letters.


(3) [ wa kOmâsǝd ǝ rEzistâs nEm fòrmo ]
Oa komañset ar Rezistañs en em furmañ.
was start.ed the "résistance" [ SC self form ]
'The resistance had started to structure itself.'
East-Kerne (Lanvenegen), Evenou (1987:627)


All examples end by indicating the dialectal variety (in italics), and a reference indicating the source. The reference is an active link, and directs to a separate bibliographic page, here the unpublished thesis of Evenou (1987).


Reference pages

Each example is linked to its precise source. Sources are active links toward a dedicated reference page. ARBRES contains

more than 160 articles of references for corpora of written and spoken Breton
more than 250 articles of references for research works, books and articles, about the language.

Each reference is associated with (when available):

  • complete bibliographical information
  • an active link toward the URL address of the reference
  • the publication history
  • an abstract or summary
  • reviews
  • identification of the dialect(s) used or discussed
  • erratum list
  • active links toward extracts of the reference (see for example: Kervella (1947))

A useful feature allows you to see exactly where a given reference has been cited within the website. In the reference article, go to the Outils [tools] box and click on pages liées [linked pages]. You will get the list of items linking to the reference.


Explore a topic

After reading a given article, there are several options to explore further.


  • Try the clickable links inside the article to explore associated pages.
  • explore the bibliography:
If an article, a thesis or a book has been written on the topic, it should appear at the bottom of the page, in "bibliographie". If this work has not been mentioned (yet), go to the general bibliography page in the Resource Center.
  • Find other pages of the same category
Each article belongs to one or more category. For example, the article on kalz, 'lot (of)', belongs to the categories determiner, adverb, quantifieur et indefinite. These are listed with links at the very bottom of the page.
Clicking on a given category gives you the list of all the pages from the same category.
  • Leave comments or questions on the "discussion" page associated with each article:
I answer comments and questions in a timely manner. I update the website accordingly, provide explanations, or develop new tools.


Cite a page of this website

I recommand the following format for citing this work:

  • Jouitteau, Mélanie. (éd.). 2009-2023. 'title of the article', In ARBRES, wikigrammaire des dialectes du breton et centre de ressources pour son étude linguistique formelle, URL of the article, [date of access].


and if you wish to make reference to the entire ARBRES grammar:

How to use the Resource Center

The Resource Center is meant to provide all type of information for research. It is always accessible through the left panel on the screen.

It provides the classical tools of printed grammars, and some less classical ones deriving from the choice of the digital medium and of the Open Science research paradigm.


Classical Tools

The bibliography is intended as a complete, up-to-date guide to the descriptive and scientific works on the syntax and morphosyntax of Breton. Whenever possible, active links are made available to documents on-line. The references used on the website are visible at a glance (they are active links toward the reference page).
Specialized journals like Hor Yezh or La Bretagne Linguistique have a dedicated page listing their contents.
  • A glossary of more than 250 technical terms used in formal grammar, and its version in French
In order to find the French translation of a term, follow the links in the English glossary - you will arrive at an article whose title is the French translation or appears in bold at the beginning of the article. For translation from French to English or Breton, search for the terminologie [terminology] subpart of the articles.
  • A list of technical abbreviations, as well as traditional acronyms for corpora and glosses.


What's new?

This website is evolutive, and can be used to share news about the study of Breton, internal or external to the website.

Here I announce calls for papers, conferences, important publications, news about the Breton language and also the lastest works on the website.
  • It is also possible to follow the history of one page in particular, or check the latest modifications on the website under Modifications Récentes.
  • A page of useful external links:
In English, Breton, French, there are links to explore the study of Breton and to electronic resources on Breton, the Celtic languages, and minority languages in general.

This is a participative website: you can post informations in the news page, add new references in the bibliography, or enrich the links pages.


Find out more about a particular dialect

The tools of the Resource Center are meant to be of help.

  • Find corpora or references
The dialects of works in the general bibliography have been geographically localized on a googlemap. This lets you visualize the geographical distribution of works on different varieties of Breton.
A grammar gives a special feature for a dialect and you want to check the facts? The map helps you find references for corpora and grammars closest to this place.
This list is far from exhaustive, but allows you to choose an author or work according to the dialect of interest.
  • Consult the list of different usable corpora in the Resource Center.
Provides references for audio transcripts, written, glossed, IPA transcripted corpora, etc.)

Learning

The system of clickable glosses makes of ARBRES a powerful tool for learning the language. One can surf from page to page, reading the Breton sentences and clicking on any item of the sentence she wants to learn more about.

One can search a lexical item in one of the different orthographic systems, or under a mutated form, and still find the page.


Teaching

There are many funny clever ways to use this website in order to create pedagogical content. Here are two suggestions :

  • The page "Catégorie:Désambiguïsations" collects a clickable list of ambiguous morphemes, which have more than one meaning in Breton. This could provide for good quiz material.
  • Specific maps of the Linguistic Atlas of Low-Brittany have been integrated within articles, making it easy to build cartographic representations for a given topic.
  • Why not propose a collaboration project with this website ?


Open Science

This website ARBRES has been since 2009 an experiment of open and participative science. This is an open research notebook. I have discussed this experiment in an article: Jouitteau (2013b).


Open Access

Open access means that the results of research are to be made available at the end of the process of research.

The Breton grammar on this website is freely accessible, and collects links toward other works made available on the web. Some articles are available for download directly from this website.


Research in the Making

Research is at heart a process accessible to anyone. Here one can see research in the making, with contributions, corrections, discussions, and new hypotheses on the rise being tested.

  • The traceability of this work is complete. Each page is associated with a complete history giving all modifications made to it since its creation. One can see live how is science done, how new data and new publications change our hypotheses .


Science 2.0

Passive Crowdsourcing

This website is provided with internal statistical tools and with google analytics. This allows for a rather precise statistical representation of the way this website is been used. This anonymized data, a flow of about 55 human visits a day, provides a useful form of feedback.


Active Crowdsourcing

The new digital tools allow for both cumulative collaboration (massive aggregation of small contributions into a single architecture), and a distributive collaboration (with differentiated tasks).

This project raises the question of your place in the process. You can help the project at many different levels. Will you take part ?


Science for Everyone

The object of science should never be reduced to civic work, because science has its own internal goals that are legitimate. However, when science can take on this citizen dimension by pursuing its own goals, why miss on this opportunity ? Some examples.

  • The organization of Redadeg 2014 asked the translation of I speak Breton, and you? in different languages. In a few days, linguists from all over the world happily participated in contributing to the page I speak Breton, what about you?, bringing together translations of this sentence in 77 different languages. 1,695 Breton speakers posted self-portraits online with these phrases in support of the Redadeg [Race for the language event].
  • Certain societal debates take place in anger due to a lack of verifiable information, a lack of knowledge of real linguistic varieties, or an accumulation of inaccuracies. The site develops linguistic discussion articles which provide concrete elements of analysis on these debates which animate society. The digital format of these articles makes them directly shareable on social networks, in a format open to a scientific discussion.

NLP uses

what type of corpus does ARBRES provide?

The data from the ARBRES wikigrammar were collected and annotated by a linguistics researcher. This is data collected to construct fundamental research in formal linguistics. In this sense, the data are those of a research notebook. The data was then organized and significantly augmented with the aim of creating a descriptive grammar, usable in its online form by the speaking community. The goal is therefore twofold: to produce a description of a natural language in its diversity, complexity and regularities, and to provide new data relevant to fundamental research debates in generative linguistics.

In the corpus that this constitutes, one finds free corpus data, extracted from oral interviews or various cultural products (newspaper articles, novels, songs, poems, collections of popular expressions, political leaflets, town hall presentation sites, posts on social networks, etc.). It contains the somewhat artificial sentences typical of grammars, but they are outweighed by other more natural ones, of varied informational structure. The copyright on these sources is respected because only a modest percentage of their sentences is cited in isolation. It is also distributed in a form containing a grammatically enriched analysis. The corpus also includes elicitation data, a result of fieldwork for linguistic description purposes. The linguist has subjected native speakers to a protocol of questions, translations, descriptive tasks of images, or tasks of judgments of grammaticality of sentences which were proposed to them. Copyright on these sources is respected because the speakers provide informed consent on the dissemination of the results of the surveys, or where applicable, on the online distribution of their voice.

The presence of elicitation data means the presence of data tagged as ungrammatical. To ensure that a precise fact is the key to the acceptability of a form, formal linguists establish minimal pairs. These pairs vary minimally. The first form is grammatical, the other ungrammatical. To understand dialectal differences, it is also important to know how far in space a given form will be understood, or accepted. Beyond the dialect boundary of a given linguistic fact, speakers report its forms as ungrammatical. We then obtain minimal pairs {dialect A, OK form / dialect B, ungrammatical form}. The minimal pairs entered in the wikigrammar can be brought together to form either training sets or model evaluation sets. As part of the translation training sets so far, ungrammatical data has not been used. Only the grammatical part of the pair joined the corpus.

The presence of written corpus data from the 20th century means, in the case of Breton, the presence of several competitive spellings. The source data has not been altered, and examples appear in their original printing spellings. However, each form is connected to its equivalent in standard spelling. Each word of the sentences is glossed (translated as if found in isolation). This gloss is clickable for the user of the interface. Its redirect address is the spelling of its standard form. The multiplicity of spellings present, combined with the systematic linking of each occurrence to its standard lemma, allows for a high diversity in the data without being detrimental to their consistency. This very system that redirects the tokens towards their respective lemmas also makes it possible to connect the various word forms. This is key in this Celtic language, which not only show inflections by suffixation, but also modifications of the initial consonant depending on the syntactic contexts in which they appear (consonant mutations). The lemma krokodil can thus be automatically linked to its occurrences in krokodil Maia 'the crocodile of Maia', ar c'hrokodil 'the crocodile', ar c'hrokodiled 'the crocodiles', war grokodileta 'about to look for crocodiles'. In the wiki, all these occurrences point to the same page dedicated to the lemma krokodil. This page being categorized as a page concerning a noun, its grammatical category is also automatically recoverable. For a detailed description of the recoverable grammatical annotations, see Jouitteau & Bideault (2023) and the details of the data extraction project by AUTOGRAMM Breton treebank II.

ARBRES is a grammar of dialects. Its corpus has a high dialectal diversity by design. This is a descriptive grammar, not a prescriptive grammar. Standard Breton is treated as a dialect among the others. The dialectal spectrum is therefore quite broad. The Gwenedeg dialect is specifically under-represented, with a relative deficit of data in this dialect which is also linguistically the furthest from the others. Its analysis requires expertise where the main editor is sometimes lacking, and as a result less data represents this dialect. Aside from this particular deficiency in the Gwenedeg dialect, we can consider that quantitatively, rare dialect facts are over-represented in the data. The linguistic facts that are in the language will be illustrated once for each major dialect, but not beyond. On the contrary, to be able to precisely describe a rare fact, its dialect distribution and the parameters of their context of appearance, its examples will be provided for each existing occurrence. Rare facts are also more likely to be the subject of thematic elicitation research, which provides more data where they occur. For the same purpose of describing the variation, the forms of different styles will co-exist within the corpus, with a quantitative over-representation of this variation compared to any single corpus. In this sense, the ARBRES corpus is not well adapted for quantitative studies, but it offers a concentrate of grammatical diversity for automatic training.

This solution of data gathering is expensive in that it requires one or more people trained in the language with minimal dialectal flexibility, a social surface suitable for reaching speakers of different linguistic profiles, ways for them to find a non-monetary advantage in passing a linguistic protocol. This work also represents a long time of coding the examples and their adequate presentation in the grammar for a human readership. It requires technical support for the design and general maintenance of the site and its updates, and technical monitoring of its accessibility on screen for various users. However, all of these necessary resources exist outside the scope of NLP. At the community level, investment may be driven entirely by internal goals. The database incrementally builds an educational and/or scientific resource in a form adapted to its audience. On the scale of small language communities, this avoids monopolizing experts to create databases which would not be usable by the general public. The development of wikigrammars is particularly recommended for the construction of pilot project resources on languages with restricted corpus, because if the IT field fails to provide finalized tools for speakers, the investment will remain beneficial for the speaking community, which can truly continue to improve it for itself. In terms of human resources, descriptive and formal linguists set themselves the task of producing language analysis material. They are generally few in number in languages with a restricted corpus, but often have profiles that are very committed to their empirical domain and the speakers who produce it, with a detailed cultural knowledge of interactions with them. The wiki solution, for its part, is directly designed for large-scale collaboration of potentially isolated contributors, which is particularly suitable for minoritized languages.

Bibliography

  • Jouitteau, Mélanie & Reun Bideault. 2023. 'Outils numériques et traitement automatique du breton', Annie Rialland, Michela Russo (dir.), Langues régionales de France: nouvelles approches, nouvelles méthodologies, revitalisation, Éditions de la Société de Linguistique de Paris, 37-74. texte.