THE EXTENDING OF THE ELECTRONIC CORPUS OF THE 17TH- AND 18TH-CENTURY POLISH TEXTS AND ITS INTEGRATION WITH THE ELECTRONIC DICTIONARY OF THE 17TH- AND 18TH-CENTURY POLISH

Principal investigator: Prof. Włodzimierz Gruszczyński

Contractors

Employees of the Institute of Polish Language, PAS:  Dorota Adamiec, Bartłomiej Borek, Renata Bronikowska, Mirella Gliwińska, Katarzyna Kryńska, Magdalena Majdak, Jagoda Marszałek, Wiesław Morawski, Ewa Rodek, Aleksandra Wieczorek

Employees of the Institute of Computer Science, PAS: Tomasz Bartosiak, Witold Kieraś, Dorota Komosińska, Bartłomiej Nitoń, Maciej Ogrodniczuk, Marcin Woliński, Alina Wróblewska

Other contractors: Magdalena Awianowicz, Halina Bedeniczuk, Joanna Bilińska-Brynk, Alina Borsewicz, Marta Chomaniuk, Anna Dzierżawska, Zbigniew Gawłowicz, Michał Godlewski, Norbert Gołdys, Artur Goszczyński, Bożena Itoya, Klaudia Jovanovska, Hanna Jurczyk, Ilona Jurkiewicz-Buchała, Ewa Karasińska-Gajo, Kacper Kardas, Agnieszka Kirsztejn, Ludwika Klejnowska, Joanna Koc, Magdalena Kołodziejczyk, Bartosz Kossakowski, Matylda Kozłowska, Grzegorz Kulesza, Weronika Lachowicz, Małgorzata Maciejewska, Agnieszka Małochleb, Emanuel Modrzejewski, Aleksandra Opalińska, Ewa Oranowska-Wróbel, Natalia Owsianka, Małgorzata Pachulska, Katarzyna Płońska, Paulina Rosalska, Martyna Sabała-Bolek, Andrea Smolarz, Katarzyna Stankiewicz, Olga Stolarczyk, Jacek Stwora, Monika Szafrańska, Agnieszka Szulińska, Bartosz Szymański, Renata Śliż, Klaudia Wieczorek, Michał Wieczorek, Patrycja Wojtasik, Krzysztof Wróbel, Maciej Zboch, Mateusz Żółtak

Project number: 11H 180413 86
Start date: 2018-12-06
End date: 2023-12-05
Funding entity: MEiN – NPRH 86

project description

The aim of the project was to continue work on the Electronic Corpus of the 17th– and 18th-century Polish Texts (until 1772) created in 2013-2018. The corpus, initially numbering 13.5 million segments and including texts from the Baroque period (hence the short name KorBa for “Baroque Corpus”), was expanded to include texts from the late 18th century belonging to the Enlightenment period. Since these cultural trends have left a clear mark on the language, the new version of KorBa comprises two subcorpora that can be searched separately: Baroque (1601-1740) and Enlightenment (1741-1800). New texts from the 17th and early 18th centuries have also been added to the corpus, selected to ensure greater chronological, geographical, genre and thematic balance. In total, the new KorBa contains nearly 27 million tokens from more than 2,000 texts. An experimental syntactically annotated corpus, consisting of 1,000 sentences has also been developed.

The new version of KorBa has been built using two new tools based on neural networks: a transcriber, designed for automatic transforming transliterated text into modern spelling, and a tagging system KFTT comprising a tokenizer, a morphosyntactic tagger and a lemmatizer. Thanks to its neural architecture, KFTT can handle less common tokenization and spelling found in historical texts with high accuracy. The use of modern technologies allowed to reduce the number of errors occurring during data processing and thus increase the reliability of the results.

The project also included integration of the four sources for research on the Polish language of the 17th and 18th centuries: KorBa, the Electronic Dictionary of the 17th– and 18th-century Polish (e-SXVII), the Digital Library of Polish and Poland-Related News Pamphlets from the 16th to the 18th Century (CBDU) and the Card-index of the Dictionary of the Polish Language of the 17th and First Half of the 18th Century (KXVII). For this purpose, the dedicated website Polish Language of the 17th and 18th Centuries (https://polszczyzna17-18.ijp.pan.pl) has been developed, which allows the simultaneous searching of these resources. In addition, connections between individual resources aimed at special purposes have been created. For instance the connections between the KorBa and e-SXVII websites make it easier for dictionary editors to use the corpus. The links between CBDU and e-SXVII allow explaining archaic words appearing in CBDU texts by referring to the appropriate e-SXVII entries. All these connections are dynamic, which means that each time data is downloaded from the current database of individual resources.

LIST OF PUBLICATIONS RELATED TO THE PROJECT

Bilińska-Brynk, J., Rodek, E., Paper Quotation Slips to the Electronic Dictionary of the 17th- and 18th-Century Polish – Digital Index and its Integration with the Dictionary, [in:] Gavriilidou, Z., Mitsiaki, M., Fliatouras, A. (eds.) Proceedings of the XIX EURALEX Congress: Lexicography for Inclusion, t. I, Democritus University of Thrace (2020), pp. 465-470.

Bronikowska, R., Kryńska, K., Łacina w KorBie. Użyteczność Elektronicznego Korpusu Tekstów Polskich XVII i XVIII Wieku dla filologa neolatynisty, Polonica XL (2020), pp. 123-135.

Bronikowska, R., Majdak, M., Wieczorek, A., Żółtak, M., The Electronic Dictionary of the 17th- and 18th-century Polish – towards the open formula asset of the historical vocabulary, [in:] Gavriilidou, Z., Mitsiaki, M., Fliatouras, A. (eds.) Proceedings of the XIX EURALEX Congress: Lexicography for Inclusion, t. I, Democritus University of Thrace (2020), pp. 471-475.

Bronikowska, R., Predykatywne użycia przymiotników w rodzaju żeńskim w dawnej polszczyźnie – semantyczna charakterystyka na podstawie danych korpusowych, Prace Filologiczne” 76, 2021, pp. 49-65. https://doi.org/10.32798/pf.869 .

Bronikowska, R., Unfinished “verbization” process – the development of predicative constructions with an adjective of the feminine gender in the 17th and 18th centuries in the light of corpus data, Polonica”, 41(1), 2021, pp. 97-110. https://doi.org/10.17651/POLON.41.7.

Bronikowska, R., Verbification of feminine forms of adjectives można ‘possible’, niemożna ‘impossible’ and niepodobna ‘impossible’ – corpus-based approach, “Jazykovedný Časopis”, vol. 74(1) (2023), pp. 9-18. https://www.juls.savba.sk/ediela/jc/2023/1/jc23-01.pdf

Gruszczyński, W., Adamiec, D., Bronikowska, R., Kieraś, W., Modrzejewski, E., Wieczorek, A. i  Woliński, M., The Electronic Corpus of 17th- and 18th-century Polish Texts, Language Resources and Evaluation” vol. 56, issue 1, 2021, pp. 309-332. https://link.springer.com/article/10.1007%2Fs10579-021-09549-1

Gruszczyński, W., Adamiec, D., Bronikowska, R., Wieczorek, A., Elektroniczny Korpus Tekstów Polskich z XVII i XVIII w. – problemy teoretyczne i warsztatowe, Poradnik Językowy” 8 (2020), pp. 32–51.

Gruszczyński, W., Adamiec, D., Majdak, M., Barokowa polszczyzna w internecie, czyli Elektroniczny słownik języka polskiego XVII i XVIII wieku, „LingVaria” 1 (2023), pp. 113–124. https://doi.org/10.12797/LV.18.2023.35.08

Majdak, M., Keywords in religious literature of 17th and 18th centuries in light of the data from the Electronic Corpus of 17th- and 18th-century Polish Texts, “Jazykovedný Časopis”, vol. 74(1) (2023), pp. 100-107. https://www.juls.savba.sk/ediela/jc/2023/1/jc23-01.pdf

Majdak, M., Znaczenia wyrazu głos w słownikach i tekstach, [in:] M. Majdak Głos. Studium leksykograficzne” series Prace Instytutu Języka Polskiego PAN 153, Kraków 2019, pp. 50-148.

Ogrodniczuk, M., Gruszczyński, W., Connecting Data for Digital Libraries: The Library, the Dictionary and the Corpus (in:) Jatowt A., Maeda A., Syn S. (eds.) Digital Libraries at the Crossroads of Digital Information for the Future. ICADL 2019. Lecture Notes in Computer Science, vol. 11853. Springer, Cham (2019), pp. 125-138.

Ogrodniczuk, M., Gruszczyński, W., Wikipedia-Based Entity Linking for the Digital Library of Polish and Poland-Related News Pamphlets. [in:] Ishita E., Pang N.L.S., Zhou L. (eds.) Digital Libraries at Times of Massive Societal Transition. ICADL 2020. Lecture Notes in Computer Science, vol. 12504. Springer, Cham (2020), pp. 81-88.

Ogrodniczuk, M., Kryńska, K., Evaluating Machine Translation of Latin Interjections in the Digital Library of Polish and Poland-related News Pamphlets, [in:] Tseng, YH., Katsurai, M., Nguyen, H.N. (eds.) From Born-Physical to Born-Virtual: Augmenting Intelligence in Digital Libraries. ICADL 2022. Lecture Notes in Computer Science, vol. 13636. Springer, Cham. https://doi.org/10.1007/978-3-031-21756-2_34

Rodek, E., Rzeczowniki żeńskoosobowe zakończone na -yni/ -ini, -ica, -iczka, -aczka, -anka, -arka w XVII i XVIII wieku (na materiale z Elektronicznego Korpusu Tekstów Polskich z XVII i XVIII w.), Prace Filologiczne” 78 (2023), pp. 337-358. https://wuw.pl/data/include/cms//Prace_Filologiczne_2023_78.pdf

Wieczorek, A., Integracja Elektronicznego słownika języka polskiego XVII i XVIII wieku i Elektronicznego Korpusu Tekstów Polskich z XVII i XVIII Wieku okiem użytkownika i redaktora, [in:] E. Horyń, E. Młynarczyk and P. Żmigrodzki (eds.) Język polski – między tradycją a współczesnością. Księga jubileuszowa z okazji stulecia Towarzystwa Miłośników Języka Polskiego, Kraków 2021, pp. 547–560.

USEFUL WEBSITES

https://korba.edu.pl
https://sxvii.pl
https://www.rcin.org.pl/dlibra/publication/20029
https://cbdu.ijp.pan.pl
https://polszczyzna17-18.ijp.pan.pl

 

Ikona z ludzikiem do otwierania panelu kontrolnego WCAG
Aa+
Aa-
Ikona kontrastu
Ikona linku
Ikona skali szarości
Ikona zmiany na czytelne czcionki
Ikona resetu ustawień WCAG