split the word karycharana and what is sandhi name
Answers
Answer:
Kannada Spell Checker with Sandhi Splitter
Akshatha A N
Department of ISE
RVCE, Bangalore
Chandana G Upadhyaya
Department of ISE
RVCE, Bangalore
Rajashekara Murthy S
Associate Professor, Department of ISE
RVCE, Bangalore
Abstract—Spelling errors are introduced in text either
during typing, or when the user does not know the correct
phoneme or grapheme. If a language contains complex
words like sandhi where two or more morphemes join
based on some rules, spell checking becomes very tedious.
In such situations, having a spell checker with sandhi
splitter which alerts the user by flagging the errors and
providing suggestions is very useful. A novel algorithm of
sandhi splitting is proposed in this paper. The sandhi
splitter can split about 7000 most common sandhi words in
Kannada language used as test samples. The sandhi splitter
was integrated with a Kannada spell checker and a
mechanism for generating suggestions was added. A
comprehensive, platform independent, standalone spell
checker with sandhi splitter application software was thus
developed and tested extensively for its efficiency and
correctness. A comparative analysis of this spell checker
with sandhi splitter was made and results concluded that
the Kannada spell checker with sandhi splitter has an
improved performance. It is twice as fast, 200 times more
space efficient, and it is 90% accurate in case of complex
nouns and 50% accurate for complex verbs. Such a spell
checker with sandhi splitter will be of foremost significance
in machine translation systems, voice processing, etc. This
is the first sandhi splitter in Kannada and the advantage of
the novel algorithm is that, it can be extended to all Indian
languages.
Keywords— Natural language processing; Morphology;
Computational linguistics; Sandhi splitter; Spell checke.
I. INTRODUCTION
Kannada is an agglutinative language. It is one of the
Dravidian languages, and by the nature of the Dravidian
languages it has very clear rules defined for every aspect
of its structure. Kannada has roughly 40 million native
speakers and it is one of the 40 most spoken languages in
the world [1]. It is influenced greatly by Sanskrit, and
therefore we can find an overlap of words, structure and
grammar rules including the sandhi and lexicon between
the two languages. Like any other language, Kannada has
grown and will continue to grow and change with the
intervention of other languages and accents, and by
people who want to make the language and its words easy
to pronounce, spell and write. There is no specific
boundary to the words in it. In a language like Kannada,
where there are abundant complex structures and
compound words, a spell checker demands a sandhi
splitter for two reasons. First, since any database of
Kannada words cannot store every sandhi word without
huge redundancy, the sandhi splitter would hugely reduce
the dictionary size. Second, sandhi splitters are critical for
recognizing spelling errors arising due to an erroneous
morpheme or an erroneous segment at the morpheme
boundary of a sandhi word.
A morpheme is the smallest meaningful unit in a
language. Joining morphemes to derive complex and
meaningful words without changing the spelling or the
phonetics of the constituent morphemes is called
agglutination. Inflection, on the other hand, is the refitting
of the words to express various grammatical aspects like
gender, tense, mood and number.
In the processing of any language, morphological
analysis, sentence structure analysis and recognition
become the founding pillars. In processing Indian
languages, in addition to the aforementioned factors,
several factors such as sandhis, samaasas, and inflections
specific to gender and tense also play a role. In Kannada,
there are three ways of forming complex words: samaasa,
jodi pada and sandhi.
A. Samaasa, and Jodi Pada
Samaasa is also known as nominal compound.
Morphologically, a samaasa has each noun or adjective in
its stem form with only the last element obtaining the case
inflection. Examples of samaasa include “peetaMbara”
and “vRukoodara”. A jodi pada is a phonemic binding of
two unrelated morphemes separated by a hyphen used in
the Kannada dialect. Examples of jodi pada include
“mane-maTha” and “deevaru-diMDaru”.
B. Sandhi
Sandhi means ‘to join’. In sandhi formation at the
word boundary, several phonological processes take
place to produce the complex word or the sandhi word.
During this process of joining, one or both following
operations occur at the word boundary:
A new letter will appear at the word boundary.