About of the metric homogeneity of texts in Slavic languages

Usmanov, Z. D.; Kosimov, A. A.

Please use this identifier to cite or link to this item: https://libeldoc.bsuir.by/handle/123456789/45451

Title:	About of the metric homogeneity of texts in Slavic languages
Other Titles:	К вопросу о метрической однородности текстов на славянских языках
Authors:	Usmanov, Z. D. Kosimov, A. A.
Keywords:	материалы конференций;texts;languages;alphabet;тексты;статистическое моделирование;родственные слова
Issue Date:	2021
Publisher:	БГУИР
Citation:	Usmanov, Z. D. About of the metric homogeneity of texts in Slavic languages / Z. D. Usmanov, A. A. Kosimov // Открытые семантические технологии проектирования интеллектуальных систем = Open Semantic Technologies for Intelligent Systems (OSTIS-2021) : сборник научных трудов / Белорусский государственный университет информатики и радиоэлектроники ; редкол.: В. В. Голенков [и др.]. – Минск, 2021. – Вып. 5. – С. 313–316.
Abstract:	In the studies of R. Gray and K. Atkinson by the statistical analysis of related words, W. Chang, C. Cathcart, D. Hall and A. Garrett using statistical modeling and A. S. Kasyan and A. V. Dybo on the basis of lexicostatistical classiﬁcation, in addition to discussing historical issues, geneological trees are presented, reﬂecting both kinship and divergence of modern Slavic languages. There are a lot of such trees, they are similar in general terms and differ in small details, see, for example. The area of the formerly common language is now divided into three groups - the eastern one, consisting of the Belarusian, Russian and Ukrainian languages, the western one from the Czech, Slovak, Polish, Kashubian and Lusatian languages, and the southern one, consisting of the Bulgarian, Macedonian, Serbo-Croatian and Slovenian languages. Using the example of a randomly generated model collection of 26 texts in 13 languages (2 works from each language), the article establishes the applicability of the γ-classiﬁer for automatic recognition of the belongingof texts to a particular group of Slavic languages based on the frequency of a set of Latin characters that is universal for all languages. The mathematical model of the γ-classiﬁer is presented in the form of a triad composed of a digital portrait (DP) of the text - the distribution of the frequency of Latin symbolic unigrams in the text; formulas for calculating the distances between DP texts and a machine learning algorithm that implements the hypothesis of “homogeneity” of works from one language group and “heterogeneity” of works belonging to different groups of languages. The tuning of the algorithm using a table of paired distances between all products of the model collection was carried out by selecting the optimal value of the real parameter γ, which minimizes the number of errors in violation of tho “homogeneity” hypothesis. The e-classiﬁer trained on the texts of the model collection showed 86% accuracy in recognizing the language groups of the works. To test the classiﬁer, 3 additional random texts were selected, one text each for three different groups of Slavic languages. By the method of the nearest (in terms of distance) neighbor, all new texts conﬁrmed their homogeneity with the corresponding pairs of monolingual works, thereby also homogeneity with the corresponding group of Slavic languages.
Alternative abstract:	В исследованиях Р. Грея и К. Аткинсона посредством статистического анализа родственных слов, У. Чанга, Ч. Кэткарта, Д. Холла и А. Гарретта с помощью статистического моделирования и А. С. Касьяна и А. В. Дыбона основе лексикостатистической классификации помимо обсуждения исторических вопросов представлены генеологические деревья, отражающие как родство, так и дивергенцию современных славянских языков. Таких деревьев достаточно много, они сходны в общих чертах и различны в небольших деталях, см. например. Ареал прежде единого языка ныне разделился на три группы – восточную в составе белорусского, русского и украинского языков, западную - из чешского, словацкого, польского, кашубского и лужицкихязыков и южную, состоящую из болгарского, македонского, сербо-хорватского и словенского языков. В статье на примере случайно сформированной модельной коллекции из 26 текстов на 13 языках (по 2 произведения от каждого языка) устанавливается применимость γ-классификатора для автоматического распознавания принадлежности текстов той или иной группе славянских языков на основе частотности универсального для все языков набора латинских символов. Математическая модель -классификатора представляется в виде триады, составленной из цифрового портрета (ЦП) текста - распределения в тексте частотности латинских символьных униграмм; формулы для вычисления расстояний между ЦП текстами и алгоритма машинного обучения, реализующего гипотезу “однородности” произведений из одной группы языков и “неоднородности” произведений, при- надлежащих разным группам языков. Настройка алгоритма, использующего таблицу парных расстояний между всеми произведениями модельной коллекции, осуществлялась путем подбора оптимального значения вещественного параметра γ, минимизирующего число ошибок нарушения гипотезы “однородности”. Обученный на текстах модельной коллекции γ-классификатор показал 86%- ю точность в распознавании языков произведений. Для тестирования классификатора были выбраны 3 дополнительных случайных текста, по одному тексту для трёх разных групп славянских языков. Методом ближайшего (по расстоянию) соседа все новые тексты подтвердили свою однородность с соответствующими парами одноязычных произведений, тем самым и однородность с соответствующей группой славянских языков.
URI:	https://libeldoc.bsuir.by/handle/123456789/45451
Appears in Collections:	OSTIS-2021

Files in This Item:

File	Description	Size	Format
Usmanov_About.pdf		150.42 kB	Adobe PDF	View/Open

Show full item record Google Scholar