Frequency Distribution of Letters, Bigrams and Trigrams in the Macedonian language

  • Aleksandra Mileva
  • Stojanče Panov
  • Vesna Dimitrova

Abstract

Frequency analysis in cryptanalysis is based on the fact that, in any given piece of written text, certain letters and combinations of two or three letters occur with varying frequencies. In this paper we present average frequency distribution of letters, bigrams and trigrams in the Macedonian language. Letter frequency of the most common first letter and last letter in words is also given. Our results are based on approximately 15000 pages of written text from the following subjects: poetry, prose, drama, natural sciences, social sciences, law, different laws, economy, and computer science. Obtained letter frequency sequence is “А О И Е Т Н Р С В Д К Л П М У З Ј Г Б Ч Ш Ц Ж Њ Ф Ќ Х Ѓ Џ Љ Ѕ”, the most common letter pairs are “НА АТ ТА НИ ТЕ РА ОТ СТ ТО КО” and the most common trigrams are “ИТЕ АТА УВА ИЈА АЊЕ СТА ОСТ ВАЊ ПРО ПРЕ”.

References

H. S. Zim (1962): Codes and secret writing. Scholastic Book Services.

H. Beker and F. Piper (1983): Cipher Systems: The Protection of Communications, John Wiley & Sons, pp. 397.

A. Beutelsacher (2005): Kryptologie. Eine Einführung in die Wissenschaft vom Verschlüsseln, Verbergen und Verheimlichen. Ohne alle Geheimniskrämerei, aber nicht ohne hinterli ... utzen und Ergötzen des allgemeinen Publikums, Aufl., Wiesbaden: Vieweg Verlagsgesellschaft, pp.10.

F. Pratt (1939): Secret and Urgent: the Story of Codes and Ciphers, Blue Ribbon Books, pp 254-255.

S. Singh (1999): Codici e Segreti, RCS.

A. Petrovski (2005): About a Macedonian Computational Dictionary, Proceedings of 2nd Balkan Conference in Informatics BCI 2005, Ohrid, 17-19 November, pp. 76-83.

K. Zdravkova and A. Petrovski (2007): Derivation of Macedonian Verbal Adjectives, International Conference RANLP, Borovets, 27-29 September, Incoma, Ltd, pp. 661-665.

K. Zdravkova (2007-2008): Создавање компјутерски ресурси за македонскиот јазик, Македонски јазик, Институт за македонски јазик „Крсте Мисирков“, година LVIII-LIX, pp. 153-174.

T. Erjavec (2010): MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora, Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010, 17-23 May, Valletta, Malta, pp. 2544-2547.

Published
2013-04-01