Corpas na Gaeilge 1882-1926

The corpus is a collection of printed Irish language texts published between 1882 and 1926. There are 281 texts included in this corpus so far. Most of these are books representative of different genres, (including, short stories, novels, plays, biographies, poetry, song lyrics, historical and religious studies, scientific reference books, government documents and translations). At this point there are 7.1 million words in this corpus, from which the entries for the Historical Irish Dictionary will be chosen.

Guidelines & Instructions.

The texts in this corpus are reproduced as they were published in the edition used, (the final edition during the lifetime of the author, where possible.) There are, therefore, a number of printing errors visible in the texts.

As yet, the material in the corpus has not been lemmatised. At this point, it is for the user to search for a term using different spellings and grammatical forms.

E.g. When researching the instances of the word “crann” (tree), a separate search should be made for the different forms; crann, crainn, cranna, gcrann, chrann, crannaibh, chrannaibh, gcrannaibh etc.

Similarly, when researching the usage of a verb, a separate search should be made under all known forms; for example, when researching the verb “téigh” (to go), a separate search should be made of téigh, chuaigh, chuaidh, chuadar, chuamar, chuathas, téann, rachaidh, rachaimid etc.

All of these forms, however, will be brought together after the lemmatisation process which is currently being tested, so that one search will result in the display of all the related forms.

This corpus can be searched by title of text, by publication date, or by author, or by any combination of those criteria, by highlighting the required choices in the three boxes displayed, (hold down the ‘Ctrl’ button for multiple choices), and by choosing the appropriate combination of ‘AND’ and ‘OR’ buttons. It is not essential to highlight a choice in every box.

Examples of the search term (either a single word, or a string) are then displayed under the titles in which they appear, along with the page number, and line number, to help with the location of the term in the original hard copy of the text. The context in which the term is used (5 lines of the original text) is also displayed.

Make sure to press the RESET FORM button before beginning another search.

Future Plans.

Tests are in progress for the lemmatisation of this corpus. This will be achieved automatically as much as possible, with the assistance of programs developed by Kevin Scannell, University of St Louis, and Elaine Uí Dhonnchadha of Trinity College, Dublin.

Additional Texts.

Additional texts will be added to this corpus as they are made available, including further books, articles from periodicals, and unpublished manuscripts.

Appreciation

FNG thanks every organisation and every individual who helped to develop this project in any way, especially Niall O’Leary of the Digital Humanities Observatory.

Notification by email of all corrections will be gratefully received, at fng@ria.ie.