Afrikaanse Lettergreepverdelingspatrone

Suid-Afrikaanse Tydskrif vir Natuurwetenskap en Tegnologie/South African Journal of Science and Technology

 
 
Field Value
 
Title Afrikaanse Lettergreepverdelingspatrone Afrikaans Syllabification Patterns
 
Creator Fick, Tilla Swanepoel, Chris J.
 
Subject — — — Lettergreepverdeling; onreëlmatigheid; masjienleertegnieke; kunsmatige neurale netwerke; beslissingsbome; patroonherkenning; afrigting.
Description Vir Afrikaans is outomatiese rekenaarmatige woordafbreking ’n probleemwat aandag vereis, aangesien foute steeds dikwels in gedrukte teks voorkom. As eerste stap in hierdie taak is dit noodsaaklik om woorde outomaties in lettergrepe te verdeel. Aangesien nuwe woorde voortdurend geskep word deur woorde aanmekaar te skryf, werk bestaande tegnieke wat vir Engels ontwikkel is, nie goed in Afrikaans nie. Dit is nodig om ’n “intelligente” tegniek vir lettergreepverdeling te ontwikkel. As eerste benadering beskou ons slegs die ortografiese inligting van woorde sonder om sintaksis of morfologie in ag te neem. Dit laat ons toe om masjienleertegnieke soos kunsmatige neurale netwerke en beslissingsbome wat bekend is vir hul patroonherkenningsvermoë vir die taak te oorweeg.Hierdie tegnieke word met geïsoleerde afrigtingspare bestaande uit invoerpatrone en ooreenstemmende uitvoere (of teikens) afgerig. In hierdie artikel verskaf ons die motivering vir die studie en bespreek die proses wat gevolg is om data vir die afrigting van masjienleertegnieke te genereer. Ons bespreek ook probleemareas soos onreëlmatige woordafbreking en verskaf ’n ontleding van letterkombinasies (of letterpatrone)in woorde met en sonder lettergreepverdeling. In contrast to English, automatic hyphenation by computer of Afrikaans words is a problem that still needs to be addressed, since errors are still often encountered in printed text. An initial step in this task is the ability to automatically syllabify words. Since new words are created continuously by joining words, it is necessary to develop an “intelligent” technique for syllabification. As a first phase of the research, we consider only the orthographic information of words, and disregard both syntactic and morphological information. This approach allows us to use machine-learning techniques such as artificial neural networks and decision trees that are known for their pattern recognition abilities. Both these techniques are trained with isolated patterns consisting of input patterns and corresponding outputs (or targets) that indicate whether the input pattern should be split at a certain position, or not. In the process of compiling a list of syllabified words from which to generate training data for the  syllabification problem, irregular patterns were identified. The same letter patterns are split differently in different words and complete words that are spelled identically are split differently due to meaning. We also identified irregularities in and between  the different dictionaries that we used. We examined the influence range of letters that are involved in irregularities. For example, for their in agter-ente and vaste-rente we have to consider three letters to the left of r to be certain where the hyphen should be inserted. The influence range of the k in verstek-waarde and kleinste-kwadrate is four to the left and three to the right. In an analysis of letter patterns in Afrikaans words we found that the letter e has the highest frequency overall (16,2% of all letters in the word list). The frequency of words starting with s is the highest, while the frequency of words ending with e is the highest. It is important to note that the frequency of words ending with s is even higher than for words starting with s. The two and three letter patterns that occur most are er (10% of all two letter patterns) and ing (4% of all three letter patterns). In an analysis of syllables in Afrikaans words, we found that (as for complete words) syllables most often start with the letter s and end with e, while the frequency of syllables ending with s is almost as high as the frequency of syllables starting with s. This indicates that problems with hyphenation can be expected around the letter s. The two and three letter syllables that occur most often are -ge- and -ver-, respectively.In an attempt to decide on the window length to use to generate training data for machine-learning techniques we also analysed the length of syllables. The results show that two and three letter syllables occur most often, but that four letter syllables have the most unique instances. We also analysed a spectrum of window configurations and found that the ideal configuration will have to be determined empirically. One major problem we identified in this study is that irregular syllabification often occurs where letter patterns include the letter s. The reasons being (i) the use of the combining s when joining words, (ii) almost equal frequencies of syllables starting and ending with s and (iii) vague hyphe- nation rules for letter combinations containing s. To effectively address automatic syllabification in Afrikaans, it is necessary to develop more sophisticated methods to handle vagueness around the letter s. 
 
Publisher AOSIS
 
Contributor — —
Date 2010-01-13
 
Type info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion — — — —
Format application/pdf
Identifier 10.4102/satnt.v29i2.9
 
Source Suid-Afrikaanse Tydskrif vir Natuurwetenskap en Tegnologie; Vol 29, No 2 (2010); 48-65 Suid-Afrikaanse Tydskrif vir Natuurwetenskap en Tegnologie; Vol 29, No 2 (2010); 48-65 2222-4173 0254-3486
 
Language eng
 
Relation
The following web links (URLs) may trigger a file download or direct you to an alternative webpage to gain access to a publication file format of the published article:

https://journals.satnt.aosis.co.za/index.php/satnt/article/view/9/9
 
Coverage — — — — — —
Rights Copyright (c) 2010 Tilla Fick, Chris J. Swanepoel https://creativecommons.org/licenses/by/4.0
ADVERTISEMENT