Sie sind hier: Dr. O'Niel Som Verlag Homepage : Automatische Sprachübersetzung
The automatic language translation kit
INTRODUCTION
simpletranslationlanguage.txt
In this text, I will show you how to use a rule based translation program. This program is called Nilix translator
and has been developed for 30 years now in 2013 (download the complete kit as tar.gz for Linux here or download as zip file for windows). The
main focus of this text is how to create a set of rules to achieve a full automatic high quality translation. As
natural languages are very complex, I want to introduce an easier language which I have constructed for this
purpose.
The Simple Translation Language
The source language consists of only about 300 lexical items and 10 grammatical rules. The words and grammar have
been adopted from Quechua, spoken in Peru, Bolivia, Ecuador and parts of Argentina and Columbia. Quechua is an
agglutinative language, which in its original form is quite complicated. For example, a single verb can include a
subject, object and a determiner as in the phrase "kuyaykim" (I love you!). The root "kuya" means "love", the
transitional suffix "yki" means "I to you" and "m" is an assertative marker which emphazises the personal
experience.
Of course, such a complex language would not be adequate as a Simple Translation Language. Because of that, I
simplified the language. The whole short description of the language, including grammar, pronounciation and basic
vocabulary, can be found in the file "simpletranslationlanguage.txt".
The ten main grammar rules are:
01. The word order is genitive-demonstrative-number-adjective-noun-relative clause, negation-adverb-verb/adjective, conjunction-subject-object-infititive-verb-marker, subordinate-main clause.
02. Questions append the marker "chu", subordinate sentences "miki", reported speech "si" and exclamations "ya".
03. The negation "mana" precedes the verb, adverb, adjective or noun. Generally, modifiers precede the modified expression.
04. Nouns and pronouns form the plural by "kuna" appended directly to the stem. Noun phrases end in "qa" (subject), "ta" (object), "mi" (equative), "ya" (vocative).
05. Postpositions are "pa" (of), "pi" (in), "man" (to), "manta" (from), "wan" (with), "paq" (for), "kama" (until), "nta" (through). Temporal postpositions add "-m".
06. Articles are "kay" (the) and "huk" (one/a). Numbers "iskay" (2), "kimsa" (3), "tawa" (4), "pichqa" (5), "soqta" (6), "qanchis" (7), "pusaq" (8), "isqon" (9), "chunka" (10), "pachak" (100), "waranqa" (1000), "eje" (0).
07. Noun compositions use a hyphen "-". Names are followed by a marker as "llaqta" (location), "runa" (human) or "qari" (man) or "warmi" (woman), "uywa" (animal), "sacha" (plant).
08. "piy" (who), "ima" (what), "may" (where) end in "taq" (interrogative), in "pas" (indefinite), without ending (relative pronoun). "tukuy" (every) and "mana" (no) may stand in front of them. "kiki" (self) is the reflexive pronoun.
09. Adjectives end in "m", derived adverbs in "tam". Comparisons use "aswan" (more), "lliw" (most) and the postposition "hina" (as), "mantas" (than).
10. Verb endings are "n" (present), "rqan" (past), "nqan" (future) and "ptin" (conditional), "y" (imperative), "na" (infinitve), "sqa" (ppp) and "sti" (ppa), "q" (actor).
The pronunciation rules are:
"q" as Scottish "loch", "ll" is "ly", "ñ" is "ny", "ch" as in "church", "r" (rolled) and vowels as in Spanish, i.e.
"a" as in "father", "e" as in "bed", "i" as in "fit", "o" as in "for", "u" as in "put". Stress falls on the last but
one syllable of the root, endings are not normally stressed.
The vocabulary has been reduced to about 300 words. All words not covered by this basic vocabulary or by
combinations of them may be taken from Spanish. In this way, for a language enthusiast or amateur linguist, it will
not be difficult to master the language at all. Words may change part of speech by different terminations.
Some differences between the Simple Translation Language and Quechua are:
1. verbs are not changed according to subject and object
2. the agglutination of suffixes is limited to one or two endings, most are written as independent words
3. possessive relations are not expressed by suffixes, but by pronouns
4. subject pronouns cannot be dropped; in Quchua, the subject is implicit in the verb ending
5. the subject is always marked by "qa" which is optional in Quechua
6. adjectives loose a possible end consonant, e.g. "hatun" (big) becomes "hatu", then "m" is appended
7. "kay" (this) and "huk" (one) are used as article
8. Quechua has a much larger vocabulary than 300 words
CHAPTER 1
test-01.txt and rules-01.txt
How to invoke the Nilix translator
There are various ways to start a translation with Nilix translator. It is necessary to open a text terminal and to
store the sentences in a text file without formatting code. For example "test-01.txt".
You may invoke the program by the command line
./nilixtranslator test-01.txt rules-01.txt
The program uses the rules which are in one file called "rules-01.txt". They are prepared and stored in machine
readable format without comments into the directory "definitions" as various separate files.
The first examples
The output of the translation program will be something like:
Nilix translator 1.6 (2013/08/07 09:32:32)
Copyright © 1983-2013 Dr. med. O'Niel Som
www.nili.com * www.nili.de
ñoqa qa kay wasi ta rikun.
I see the house.
kay runa pa kay wasi qa chaypi kan.
The house of the man is here.
kay runakuna qa kay misi ta rikun.
The men see the cat.
kay runa qa kay misikuna ta rikun.
The man sees the cats.
añay ya.
Thank you.
The words used in these example sentences are:
nouns
runa man
wasi house
misi cat
ñoqa I
verbs
ka to be
riku to see
aña to thank
other
kay article (the)
chaypi here
qa subject
ta object
ya exclamation
endings
-kuna plural
-n present tense
-y imperative
Original Quechua phrases would be:
runapa wasin chaypim.
runakunaqa misita rikunku.
runa misikunata rikun.
añaykim.
wasita rikuni.
The translation rules
The first translation rules are in the file "rules-01.txt". The main sections are explained here.
Every section in the rules begins with a key word and terminates with a line containing only hyphens (---).
The key words are:
COMMENTS
SUBSTITUTIONS
LEXICON
TERMINATIONS
IDIOM
SYNTAX
COMPLEX
FUNCTIONS
INFLECTION
IRREGULARITY
FINAL SUBSTITUTIONS
USER LEXICON
These section names may be singular or plural, but have to be UPPER CASE.
When you (or a script) call ./makenilix then these human readable rules are analysed and written to the directory
"definitions" as various .nlx files as well as a main file "exec.nlx" which contains the order in which the rules are
applied. Some of the sections may occur more than once: IDIOM, SYNTAX and COMPLEX.
The most important and interesting sections are LEXICON and USER LEXICON, SYNTAX, COMPLEX and INFLECTION which I
will explain first.
Section LEXICON
Here are all grammatical word entries such as prepositions, auxiliary verbs and pronouns
which will be treated in a special way. Adjectives, nouns, verbs and simple adverbs will be placed in the section
USER LEXICON.
Nilix works internally with three grammatical categories for each word entry, consisting of 9 characters. Every
combination of letters is possible, but they should be easy to remember.
syntax or part of speech: SUB, VER, ADJ, ADV
semantics: HUM, MAS, FEM
inflection: VER, VBE, VHA
The choice of these three-letter-abbreviations is free. There are some syntax symbols with a fixed meaning:
E** ending
ADP verb prefix (in languages as German)
ZZZ unknown word
Upper case words may receive one of the following symbols if stored into the lexicon:
NAM name
TIT title
LIN language
GEO geographic name
HUM human
DAY day name
The rule writer (i.e. you) must use the symbols consistently. For example, I use VBE for inflection of the
verb "to be" and VHA for the verb "to have". The whole grammatical definition is written as a string of 9 letters:
VER---VER (which means syntax=VER, semantics=void, inflection=VER)
Abbreviations for commoner patterns may be defined as
x = ADV------
Each lexicon line contains a word entry as source language, grammar string, target language, each separated by a
blank:
ñoqa SN1---PN1 I
kay DET------ the
Section USER LEXICON
Here is the main part of lexicon, mainly all adjectives, adverbs, nouns, and verbs.
Each line contains an entry in the form: source language - grammar string - target language. Frequent grammar
strings may be abbreviated before the abbreviation is used (similar to section LEXICON):
v = VER---VER
s = SUB---SUB
runa s man
misi s cat
riku v see
misi s cat
Irregular forms in the source language may be replaced by regular entries. Let us assume we had English as source
language, then we could create an entry for the irregular plural "children":
child s wawa
children>child s wawa
Idioms consisting of two words may be stored in the user lexicon. The first line contains the words of the source
language, and the second line is the replacement in the target language. For various words in the target language,
the special symbol "§" must be used (which is converted later to a blank). The target language entry begins with a
grammar string, this can be one of the abbreviations defined earlier.
- qarim wawa
s boy
This replaces the string "qarim wawa" by the substantive "boy", the grammar string is expanded from "s" to
"SUB---SUB" according to the abbreviation introduced above.
In the source line, "-" means the words follow one another immediately. As an alternative, "+" is used when there
are possibly other words in between (such as article, negation etc.).
Section TERMINATIONS
For every inflected source language, there will be a lot of suffixes to strip from a word before lexical search. The
longest terminations must come first.
-nqan E11 future
-rqan E12 past
-ptin E13 conditional
-n E10 present
A word in the source text ending in "-nqan" will be looked up in the lexicon without that termination, and if it is
found, then two entries will be made: the first will be the verb, and the second will be the ending.
Example: "rikunqan" is separated into "riku" (to see) and "nqan" (future ending). The resulting string table would
be:
RIKU VER---VER see
NQAN E11------ future
From these entries, a syntax string is constructed, which will be the basis of the translation, in our example this
would be VERE11.
Section SYNTAX
These are the main sections for the syntax analysis and transformation of a sentence. At the beginning of the syntax
rules, the whole input sentence is stored in a two dimensional array like this:
DEB kay the
SUB runa man
E01 qa nominative
DEB kay the
SUB misi cat
E02 ta accusative
VER riku see
E10 present tense
SEN sentence end
One rule is written in one line.
DEBSUB -> SUB (001,002)
This rule searches for the occurence of DEB and SUB following immediately and replaces them by the symbol NP1, which
contains the elements in the order 001,002 (in this case, the order is not changed). The above sentence would be
after the application of this rule:
SUB DET (kay-the) SUB (runa-man)
E01 QA
SUB DET (kay-the) SUB (misi-cat)
E02 TA
VER riku see
E10 present tense
SEN sentence end
You can use all quantors "*" to reduce the number of rules and to combine similar rules on similar symbols:
DE*SUB -> SUB (001,002)
This rule would match DEM (demonstrative), DEB (determined article), DEU (indetermined article) and so on.
Sometimes, you want to use one, two or three characters of the left side also on the right side of the rule. Then
you can use the all quantor ".".
SU.E10 -> SU. (001,plu)
This rule would substitute SUM E01 by SUM, and SUF E02 by SUF. On the right side, you see in parenthesis the use of
an auxiliary word, here "plu" for plural, which is inserted automatically.
To analyse the case markers for subject and object phrases, we can apply rules like these:
SU*E01 -> NOM (001,nom)
SU*E02 -> ACC (001,acc)
To change the order of constituents, you may apply a rule of this type:
NOMACCFIV -> SSS (001,003,002)
This would replace the syntax string "NOM" (nominative) "ACC" (accusative) and "FIV" (finite verb) by the symbol
"SSS" (sentence), and the order would be changed from subject-object-verb to subject-verb-object.
All rules are evaluated repeatedly, until no further changes are made. Repitition is important for recognizing
embedded structures such as relative clauses. It is the responsability of the rule writer, to avoid circular
substitutions, which would not terminate as in:
AAA -> BBB (001,xxx)
BBB -> AAA (001,yyy)
Section COMPLEX
These rules involve entries which are not necessarily following one another. We need such discontinuous rules for
example for generating a verb form according to its subject, in other words: for the subject-verb-congruence.
In English, we have to consider only three cases:
subject I -> verb first person singular (only for "I am" and "I was")
subject singular (he, she, it or singular noun phrase) -> verb third person singular
every other subject -> verb plural
As an example, I will show here the rule for putting the attribute "fir" for first person to the verb phrase.
2 +
NOM #SN1
FIN /fir
unchanged_
symbol_FIV insert_fir insert_sin
The start line means: the rule consists of two entries, which follow one another (not necessarily immediately).
The first item to find is an object with the syntax string "NOM" and the attribute #SN1. The symbol "#" is used for
the part of speech or syntax, which is the first three letters in the grammar string of every entry. The same symbol
has been used in the IDIOM section. So, the word "ñoqa" which means "I" is in the lexicon as "SN1---SN1I". The
second item to find is an entry tagged FIN (finite verb), which has not yet the attribute "fir". The slash means
"not".
The last two lines of the rule are the actions. "unchanged_" means: do not change the line where NOM was
found. The last line contains the three actions for the FIN found:
"symbol_" change symbol here from FIN to FIV
"insert_" append an short attribute, "fir" (first person) and "sin" (singular).
You can see the other similar rules to determine the concord of subject and verb for plural subjects and "you", for
imperatives without subject and for the remaining subjects which must be third person singular in the file
rules-01.txt in the COMPLEX section.
Remember that you may choose whichever abbreviation you like. So insted of "fir" you might have chosen "pe1" (person
number one) or something similar. But consequently, you would have to use this symbol in all the other rules, too.
Section INFLECTION
In this section, the actual forms of words are generated according to grammatical attributes. As an example, the
following rule generates for the inflecional category VBE (verb to be) and the attribute "plural" the word form
"are". The "=" means substitute the whole entry.
VBE (plu) -> =are
More often, endings are generated, using "-" as for the third person singular or noun plural "s"
VE* (thi,sin) -> -s
SU* (plu) -> -s
Section IRREGULARITY
After genertion of inflected forms, there are sometimes irregular words, which can be treated by simple substitution
rules. For irregular verb forms of the verb "to have":
haves -> has
haved -> had
Or, irregular plural forms of substantives:
mans -> men
womans -> women
The remaining sections
Section COMMENTS
In this section, you will find comments for the human reader. They will be ignored by the program.
Section SUBSTITUTIONS
All characters are converted internally to upper case. The letters a-z are automatically transformed into A-Z.
Some other substitutions might be vowels with accents or other diacritics:
á -> A
é -> E
ñ -> NY
Section IDIOM
Idioms means here lexical entries which consist of various words. Not only words but also part-of-speech or any
other grammatical information may be found. These rules are applied before syntax transformations begin.
The rule consists of two lines. In the first line, we tell the program which items to find. The second line contains
the actions to be taken. For example:
#STR
E
This rule means: search an entry tagged by the syntax (#) string STR (abbreviation for sentence-start), and if found
erase that item (E). Another possible action is R (replace):
#VDO
R*VER---VDOdo
This second rule would search for the syntax symbol VDO (which might be used for verb-to do), and if found, replace
it by an entry with grammar string VER---VDO and lexical entry in the target language "do".
Section FUNCTIONS
Some function words are substituted at the end of syntax transformations by real words, for example:
POF PRE of
The symbol POF is generated and inserted by the program when it encounters a syntax rule of this form:
GENSUB -> SUB (001,POF,002)
POF appears in upper case on the right side of the rule to indicate a complete new word in the target language,
different to all entries in lower case (which are short grammatical attributes).
Section FINAL SUBSTITUTIONS
This section contains, as the name suggests, some substitutions which are necessary at the end of the application of
all rules. For example, the "§" is substituted by a blank, and a blank before an interpunction is removed.
§ ->
. -> .
Normally, there are no changes required for this section.
How a translation is performed
When we translate a sentence, all rules which match the structure of the phrase are applied sequentially and
repeatedly until the syntax string has been reduced to SSS. Let us consider the simple phrase:
ñoqa qa kay wasi ta rikun.
The final translation is "I see the house". The following output has been generated by the verbose (or visible)
option of the Nilix translator.
########################################################
ñoqa qa kay wasi ta rikun.
version = Nilix translator 1.6 (2013/08/07 09:32:31)
pathname = definitions/
visible = TRUE
filename = pass1
resultname = pass2
shellmode = TRUE
MODUS=C
-----------------------------------------------------------------------------
*: visible = false
-----------------------------------------------------------------------------
idio1.nlx:
words in idioms at beginning of complexes (idiome.pas)
1 ZZ_START STR------(START)
2 NYOQA SN1---PN1I
3 QA C01------nominative
4 KAY DEB------the
5 WASI SUB---SUBhouse
6 TA C02------accusative
7 RIKU VER---VERsee
8 present E10------present
9 . ZZZ---ZZZ.
Idiom rule #ZZZ
-> R#NAM
1 STR------(START)
2 SN1---PN1I
3 C01------nominative
4 DEB------the
5 SUB---SUBhouse
6 C02------accusative
7 VER---VERsee
8 E10------present
9 NAM---ZZZ.
Idiom rule .
-> R*SEN------§.
1 STR------(START)
2 SN1---PN1I
3 C01------nominative
4 DEB------the
5 SUB---SUBhouse
6 C02------accusative
7 VER---VERsee
8 E10------present
9 SEN------§.
-----------------------------------------------------------------------------
*: show original words
ZZ_START STR------(START)
NYOQA SN1---PN1I
QA C01------nominative
KAY DEB------the
WASI SUB---SUBhouse
TA C02------accusative
RIKU VER---VERsee
present E10------present
§. SEN------§.
-----------------------------------------------------------------------------
*: syntax start
-----------------------------------------------------------------------------
synt1.nlx:
syntax rule SN* -> SUB
order: 001
found at 2
STRSUBC01DEBSUBC02VERE10SEN
1 STR: ZZ_START
2 SUB: NYOQA
3 C01: QA
4 DEB: KAY
5 SUB: WASI
6 C02: TA
7 VER: RIKU
8 E10: present
9 SEN: §.
syntax rule VERE10 -> FIN
order: 001pre
found at 7
STRSUBC01DEBSUBC02FINSEN
1 STR: ZZ_START
2 SUB: NYOQA
3 C01: QA
4 DEB: KAY
5 SUB: WASI
6 C02: TA
7 FIN: RIKU pre
8 SEN: §.
-----------------------------------------------------------------------------
synt2.nlx:
syntax rule DE*S.. -> S..
order: 001002
found at 4
STRSUBC01SUBC02FINSEN
1 STR: ZZ_START
2 SUB: NYOQA
3 C01: QA
4 SUB: KAY WASI
5 C02: TA
6 FIN: RIKU pre
7 SEN: §.
syntax rule S**C01 -> NOM
order: 001nom
found at 2
STRNOMSUBC02FINSEN
1 STR: ZZ_START
2 NOM: NYOQA nom
3 SUB: KAY WASI
4 C02: TA
5 FIN: RIKU pre
6 SEN: §.
syntax rule S**C02 -> ACC
order: 001acc
found at 3
STRNOMACCFINSEN
1 STR: ZZ_START
2 NOM: NYOQA nom
3 ACC: KAY WASI acc
4 FIN: RIKU pre
5 SEN: §.
-----------------------------------------------------------------------------
comp1.nlx:
STRNOMACCFINSEN
1 STR: ZZ_START
2 NOM: NYOQA nom
3 ACC: KAY WASI acc
4 FIN: RIKU pre
5 SEN: §.
idiom rule: NOM #SN1 ù FIN /fir
substitution: U ù SFIV Ifir Isin
After processing:
1. operation U in line 2
Line indexes: 2 4
STRNOMACCFINSEN
1 STR: ZZ_START
2 NOM: NYOQA nom
3 ACC: KAY WASI acc
4 FIN: RIKU pre
5 SEN: §.
After processing:
2. operation S FIV in line 4
Line indexes: 2 4
STRNOMACCFIVSEN
1 STR: ZZ_START
2 NOM: NYOQA nom
3 ACC: KAY WASI acc
4 FIV: RIKU pre
5 SEN: §.
After processing:
2. operation I fir in line 4
Line indexes: 2 4
STRNOMACCFIVSEN
1 STR: ZZ_START
2 NOM: NYOQA nom
3 ACC: KAY WASI acc
4 FIV: RIKU pre fir
5 SEN: §.
After processing:
2. operation I sin in line 4
Line indexes: 2 4
STRNOMACCFIVSEN
1 STR: ZZ_START
2 NOM: NYOQA nom
3 ACC: KAY WASI acc
4 FIV: RIKU pre fir sin
5 SEN: §.
-----------------------------------------------------------------------------
synt3.nlx:
syntax rule NOMACCFIV -> SSS
order: 001003002
found at 2
STRSSSSEN
1 STR: ZZ_START
2 SSS: NYOQA nom RIKU pre fir sin KAY WASI acc
3 SEN: §.
-----------------------------------------------------------------------------
synt4.nlx:
syntax rule ...SSS -> SSS
order: 001002
found at 1
SSSSEN
1 SSS: ZZ_START NYOQA nom RIKU pre fir sin KAY WASI acc
2 SEN: §.
syntax rule SSS... -> SSS
order: 001002
found at 1
SSS
1 SSS: ZZ_START NYOQA nom RIKU pre fir sin KAY WASI acc §.
-----------------------------------------------------------------------------
*: select first entry of polyvalent syntax; syntax end
-----------------------------------------------------------------------------
func.nlx:
Syntax end, table before interpreting rules (syntax.pas)
SSS
1 SSS: ZZ_START NYOQA nom RIKU pre fir sin KAY WASI acc §.
Store normal f=nom
Store normal f=pre
Store normal f=fir
Store normal f=sin
Store normal f=acc
-----------------------------------------------------------------------------
idio2.nlx:
words in idioms at beginning of complexes (idiome.pas)
1 ZZ_START STR------(START)
2 NYOQA SN1---PN1I
3 nom aux---auxnom
4 RIKU VER---VERsee
5 pre fir sin aux---auxpre fir sin
6 KAY DEB------the
7 WASI SUB---SUBhouse
8 acc aux---auxacc
9 §. SEN------§.
Idiom rule #STR
-> E
1 SN1---PN1I
2 aux---auxnom
3 VER---VERsee
4 aux---auxpre fir sin
5 DEB------the
6 SUB---SUBhouse
7 aux---auxacc
8 SEN------§.
-----------------------------------------------------------------------------
flex.nlx:
Test words[1] SN1---PN1I
Test words[2] aux---auxnom
Test words[3] VER---VERsee
Test words[4] aux---auxpre fir sin
Test words[5] DEB------the
Test words[6] SUB---SUBhouse
Test words[7] aux---auxacc
Test words[8] SEN------§.
-----------------------------------------------------------------------------
irrg.nlx:
words in idioms at beginning of complexes (idiome.pas)
1 NYOQA SN1---PN1I
2 nom aux---auxnom
3 RIKU VER---VERsee
4 pre fir sin aux---auxpre fir sin
5 KAY DEB------the
6 WASI SUB---SUBhouse
7 acc aux---auxacc
8 §. SEN------§.
-----------------------------------------------------------------------------
*: show with part of speech
NYOQA SN1---PN1I
nom aux---auxnom
RIKU VER---VERsee
pre fir sin aux---auxpre fir sin
KAY DEB------the
WASI SUB---SUBhouse
acc aux---auxacc
§. SEN------§.
I see the house.
########################################################
CHAPTER 2
test-02.txt and rules-02.txt
Multiple parts of speech
One characteristic of the Simple Translation Language is that many words may be used as various parts of speech. In
English, there is a similar feature, e.g.
love = noun, verb
round = adjective, verb, substantive
back = adverb, substantive, verb
Let us define an abbreviation for the commonest type noun/verb in the USER LEXICON:
s = SUB---SUB
v = VER---VER
n = V/N
wayllu n *s love *v love
Then there should be a disambiguation procedure. The easiest way to achieve this is to use a standard SYNTAX
section.
V/NC.. -> SUBC.. (001+002) > 001=SUB
V/NE1. -> VERE1. (001+002) > 001=VER
The first rule says: whenever a V/N is followed by a case marker (C01="qa" etc.) then change its symbol to noun. In
the second rule, the program will change every V/N to VER whenever a verb termination follows (E10="n"=present
tense).
If no disambuguation rule maches, then the first entry is chosen automatically at the end of the syntax
transformation process. So be careful to write then entry in the USER LEXICON:
wayllu n *s love *v love
ENDCOE
would be different from the following entry (where v=VER---VER is the default):
CODE
wayllu n *v love *s love
If you observe the rules in the TERMINATION section, you will see that there is another possibility to disambiguate words
directly:
-NQAN E11 future V..
-RQAN E12 past V..
-PTIN E13 conditional V..
-N E10 present V..
The V.. at the end of each line means: if this ending is found then treat the word as verb.
Possessive pronouns
In the grammar rules for the Simple Translation Language, we have seen that the genitive case is formed by "pa".
"wasi pa punku" means "door of the house" or "house door". The same genitive rule is applied to personal pronouns:
"ñoqa pa wasi" means "my house". As the literal translation would be "house of me", we must invent a rule for
English possessive pronouns.
The simplest way is to include two word lexical entries in the USER LEXICON. We could define an abbreviation "o" as
part of speech for possessive pronouns.
o = PPN------
+ ñoqa pa
o my
+ qam pa
o your
In the syntax rules, we have to include PPN in the recognition of noun phrases. In rules-01.txt, there were the
following rules:
; build noun phrases
ADJS.. -> S.. (001,002)
DE*S.. -> S.. (001,002)
NUMS.. -> S.. (001,002,plu)
Here, we would insert the line in this SYNTAX section:
PPNS.. -> S.. (001,002)
But, as possessive pronouns in English are treated in the same way as demonstrative pronouns, we could use the part
of speech DEM, so no new syntax rule would be required. Let us go back to the USER LEXICON and insert the following
lines instead:
d = DEM------
+ ñoqa pa
d my
+ qam pa
d your
With this definition, we can leave the SYNTAX section as it was before.
Various inflectional classes
When we now issue the command to translate our text file for this chapter, we get:
Nilix translator 1.6 (2013/08/07 10:45:46)
Copyright © 1983-2013 Dr. med. O'Niel Som
www.nili.com * www.nili.de
ñoqa qa qam ta wayllurqan.
I loveed you.
ñoqa pa wayllu qa.
My love.
ENDCOE
Of course, you will see an error in the past tense of love: the program has simply added "-ed", so the wrong form
"loveed" is the result.
In English, verbs ending in "-e" are conjugated different in the past tense and in the gerund form (dropping the
final "-e" before "-ing"):
CODE
love love-s love-d lov-ing
kiss kiss-es kiss-ed kiss-ing
seem seem-s seem-ed seem-ing
In the inflecional part of the grammar string, we can define a different class for verbs ending in "-e" and also a
class for verbs requiring "-es" in the present.
v = VER---VER
ve = VER---VEE
vs = VER---VES
Next, the entry for "wayllu" would be changed to contain "ve" instead of "v". Then, we would construct different
rules for our classes in the INFLECTION section:
VES (thi,sin) -> -es
VE* (thi,sin) -> -s
VEE (pas) -> -d
VE* (pas) -> -ed
To avoid unnecessary rules, the allquantor * is used after the specific rule to catch all remaining verbs.
CHAPTER 3
rules-03.txt and test-03.txt
Translating infinitives
In the Simple Translation Language, modal/auxiliary verbs are combined directly with a preceding infinitive.
In English, there are infinitives with "to" and infinitives without "to", for example:
ñoqa qa purina munan.
I want to go.
ñoqa qa purina atin.
I can go.
In the LEXICON, some modal verbs can be inserted, disregarding for the moment the possibility of an usage as noun.
muna VER---VER want
ati VER---VER can
tiya VER---VER must
It would be nice to mark the verbs which require a "to" infintive. We can do this in the semantic field not used so
far:
muna VERTOIVER want
where TOI means: "infinitive with to".
The rule to recognize an infinitive is by the termination E15 (-na) in the corresponding SYNTAX section.
V**E15 -> INF (001,inf)
Let us define now a COMPLEX section, which evaluates the combination INF FIN, infinitive and finite verb following
one after another. In the items to recognize, we focus the program to compare only the semantic section, which is
marked by ^.
# test part of speech
^ test semantic
~ test inflection
$ test the whole grammar string
/ negation
The rule can be written as this using the symbol PTO as the "preposition to" that may not be present yet:
2 >
INF /PTO
FIN ^TOI
insert_0PTO
unchanged_
2 > means: there are two items, which follow immediately (no other words may come in between).
insert_0PTO means: insert at position 0 (at the beginning) the item PTO.
unchanged_ means: leave this line unchanged.
Somewhere, the symbol PTO must be expanded to a full preposition "to", the adequte place for this is section
FUNCTIONS:
PTO PRE to
Finally, we must change positions of infinitive and finite verb in English by a SYNTAX rule to be inserted.
INF FI. -> FI. (002,001)
We can translate now the first two sentences of text-03.txt correctly.
CHAPTER 4
rules-04.txt and test-04.txt
Generating English verb tenses
So far, we considered only present and past tense because their translation is straightforward. Looking at the
future tense, we will see that we must generate an auxiliary verb in English. Here are all tenses of the verb
"riku":
rikun see
rikurqan saw
rikunqan will see
rikuptin would see
rikusti seeing
rikusqa seen
rikuy see!
rikuna to see
To translate the future and conditional tense, all we have to do is to insert an auxiliary in the right place, that
is at the end of the verb phrase because the normal order of constituents is infinitive-finite verb.
VERE11 -> INFFIV (001,inf+WIL)
VERE13 -> INFFIV (001,inf+WLD)
Now, you can translate the example phrases in test-04.txt. The result will be:
Nilix translator 1.6 (2013/08/07 10:45:46)
Copyright © 1983-2013 Dr. med. O'Niel Som
www.nili.com * www.nili.de
ñoqa qa huk misi ta rikunqan.
I will see a cat.
qam qa misikuna ta rikuptin.
You would see cats.
CHAPTER 5
rules-05.txt and test-05.txt
Negative phrases
In the Simple Translation Language, there is only one negation "mana", which is placed before the verb phrase. In
English, we must consider two cases:
(1) no auxiliary verb is present, then add "to do"
(2) an auxiliary verb is there, then use only "not"
Auxiliary verbs, which can be negated, should be easily distinguishable. So we choose a symbol for them:
can
could
must
will
would
do
does
did
am
is
are
was
were
may
might
shall
should
A similar procedure will be used to determine if a inverted word order is necessary in a question. Some of the forms
mentioned in the list above will be generated by the program, such as WIL (will) and WLD (would). We should
substitute them by complete entries with full grammar string. To do this only at syntax end in section FUNCTIONS
would be too late.
One of various possibilities is the use of a semantic marker "AUX", placed in the middle of each grammar string just
in the LEXICON section:
ati VERAUXVCA can
tiya VERAUX--- must
ka VERAUXVBE be
ruwa VERAUXVDO do
We will generate the auxilaries directly at then verb ending analysis, with the temporary syntax symbol XXX
to be substituted later:
; future and conditional with auxiliary
VERE11 -> INFXXX (001,inf+WIL)
VERE13 -> INFXXX (001,inf+WLD)
WIL and WLD are expanded to full entries in the next COMPLEX section:
; substitute WIL and WLD
1 +
XXX WIL
symbol_FIN insert_0VERAUX---will erase_1
1 +
XXX WLD
symbol_FIN insert_0VERAUX---would erase_1
When we have all verb phrases analysed but before the subject - verb congruence, then only the symbols INF and FIN
are used. The moment has come now to treat the negation in these cases:
a) an auxiliary is there (delete NEG, and insert it directly after the auxiliary verb, the position indicator is 1+)
2 >
NEG
FI* ^AUX
delete_
insert_1+NEG------not
b) no auxiliary (change NEG to FIN, insert "do" at the beginning, and mark the full verb as infinitive)
2 >
NEG
FI* /^AUX pre
symbol_FIN insert_0VERAUXVDOdo insert_pre
symbol_INF erase_2 insert_inf
2 >
NEG
FI* /^AUX pas
symbol_FIN insert_0VERAUXVDOdo insert_pas
symbol_INF erase_2 insert_inf
We include the forms of "do" and "can" into the INFLECTION section to generate its irregular forms:
VDO (pas) -> =did
VDO (thi,sin) -> =does
VDO (ppp) -> =done
VCA (pas) -> =could
Finally, you will see that the negative of "can" is translated as "can not", which should be "cannot", so another
rule is included as IDIOM before section FINAL SUBSTITUTIONS:
can 0not
E R*VER------cannot
0not means: zero other words may come in between
E = erase
R = replace
What is not implemented yet: the future or conditional of "ati" (can), which we had to translate as "will/would not
be able to...", but we can expand the last IDIOM rule by
will 1can
U R*VER------be§able§to
would 1can
U R*VER------be§able§to
1can means: one word may come in between (e.g. not).
The final translation of test-05.txt is then:
Nilix translator 1.6 (2013/08/07 10:45:46)
Copyright © 1983-2013 Dr. med. O'Niel Som
www.nili.com * www.nili.de
qam qa mana rikunqan.
You will not see.
ñoqa qa mana rikun.
I do not see.
kay runa qa mana rikurqan.
The man did not see.
huk misi qa mana rikuna atin.
A cat cannot see.
misikuna qa mana rikuna atirqan.
Cats could not see.
misikuna qa mana rikuna atinqan.
Cats will not be able to see.
misikuna qa mana rikuna atiptin.
Cats would not be able to see.
CHAPTER 6
test-06.txt and rules-06.txt
Questions
For the translation of questions, there are two cases to consider (similar to negative phrases):
(1) with auxiliary
(2) without auxiliary
both of them in positive and negative sentences.
All questions terminate in "chu" in the Simple Translation Language. The first thing is to include this particle
into the section LEXICON. We will use the symbol QQQ for it.
chu QQQ------ question
We must treat the pronoun "you" as plural and supply a rule for it before treating all pronouns and names as nouns
in the SYNTAX section. I have seen this error, when I tried to translate a "Do you..." question.
SN2 -> SUP (001,plu)
NAM -> SUB (001)
SN* -> SUB (001)
The introduction of "do" and copying of the tense marker from the main verb to the auxiliary, is the same process as
for negative phrases. We define a temporary symbol "DDD" for the "do" which has not yet received the appropriate tense
and rewrite the rule for negative sentences to match this.
; introduce "do" (DDD) negatives without auxiliary in present and past.
; T1=exchange this line and line 1 because the normal order is infinitive - finite verb
2 >
NEG
FIN /^AUX
symbol_DDD insert_0DDDAUXVDOdo
symbol_INF insert_inf exchange_1
; introduce "do" (DDD) in questions when no auxiliary is present
3 >
*** /&INF
FIN
QQQ
unchanged_
symbol_INF insert_inf after_*DDDAUXVDOdo
unchanged_
; copy the tense marker (pre=present, pas=past) from the infinitive to the inserted "do", DDD->FIN
2 >
INF pre
DDD
erase_1
symbol_FIN insert_pre
2 >
INF pas
DDD
erase_1
symbol_FIN insert_pas
At this point, all questions should possess an auxiliary verb. Its position is changed to the beginning of the phrase
in the SYNTAX section, at the same time QQQ (="chu") is dropped.
NOM *** INF FIN QQQ -> SSS (004,001,003,002)
NOM INF FIN QQQ -> SSS (003,001,002)
... QQQ -> ... (001)
If you like, you may expand the rules for contracted negative forms. R> means: replace only the target language
item.
can 0not
E R>cannot
could 0not
E R>couldn't
will 1can
U R>be§able§to
would 1can
U R>be§able§to
do 0not
E R>don't
did 0not
E R>didn't
will 0not
E R>won't
would 0not
E R>wouldn't
Now, we can successfully translate all phrases from test-06.txt.
Nilix translator 1.6 (2013/08/07 10:45:46)
Copyright © 1983-2013 Dr. med. O'Niel Som
www.nili.com * www.nili.de
qam qa kay misi ta rikun chu?
Do you see the cat?
qam qa kay misi ta rikuna atin chu?
Can you see the cat?
qam qa mana rikunqan chu?
Won't you see?
ñoqa qa mana rikun chu?
Don't I see?
CHAPTER 7
test-07.txt and rules-07.txt
Adjective inflection
At the beginning, we want to disambiguate adjectives wherever various parts of speech are possible.
V/NE3- -> ADJE3. (001+002) > 001=ADJ
Up to now, we have used adjectives only in the basic form. In English, there are 3 derivations:
(1) adverb, -ly
(2) comparative, -er or more
(3) superlative, -est or most
In the Simple Translation Language, derived adjectives have the ending "tam" instead of "m", the comparative is
formed by a preceding "aswan" and the superlative by "lliw". The comparative particle is "hina" (as) or "mantas"
(than) or a genitive for the superlative as in English.
In the TERMINATIONS section, we add a line for adverbs:
-tam E31 adverb
-m E30 adjective
Some new entries go to the LEXICON section:
aswan MOR------ more
lliw MOS------ most
Let us analyse the adjective inflection first in a SYNTAX section.
ADJE31 -> ADV (001,adv)
MORAD. -> AD. (002,cmp)
MOSAD. -> AD. (002,sup)
"aswan" and "lliw" might be used as adverbs receiving the ending "tam". In all other cases, when they are not used
in front of an adjective, treat them as adjectives.
MO.E31 -> ADV (001,adv)
MO* -> ADJ (001)
To generate the inflectional forms, we have to consider two cases:
(1) comparative and superlative are formed by endings (-er, -est)
(2) by using "more" and "most"
As case (2) is much more frequent in English than case (1), we will mark only case (1) with a special inflection
class. In this class, we must separate adjectives ending in -e (and adjectives which double their last consonant (as
hot, hotter, hottest) or which change -y to -ier/-iest.
In the USER LEXICON, we can define various adjective classes and introduce some adjectives for test phrases.
a = ADJ---ADJ (use more, most)
ar = ADJ---ADR (-er, -est)
ae = ADJ---ADE (-e)
ad = ADJ---ADD (doubling consonant)
ay = ADJ---ADY (-y)
uchu ae little
alli a good
kusi n *s happiness *ve amuse *ay happy
The rules for the INFLECTION section are (the symbol > means here: double the last character, then append an
ending):
ADR (cmp) -> -er
ADR (sup) -> -est
ADE (cmp) -> -r
ADE (sup) -> -st
ADD (cmp) -> >er
ADD (sup) -> >est
ADY (cmp) -> <-ier
ADY (sup) -> <-iest
ADY (adv) -> <-ily
AD* (adv) -> -ly
But when we have an adjective which does not use ending in comparative and superlative, we have to recognize these
before inflection and delete the "cmp" or "sup" attribute after generation of "more" and "most". We introduce a
COMPLEX section rule that renames MOR/MOS to a simple adverb ADV, then the generation of "cmp" and "sup" is
suppressed automatically.
2 >
MOR
ADJ ~ADJ
symbol_ADV
U
Finally, there are irregular adjective comparisons and adverb formations, some of them are:
goodly -> well
gooder -> better
goodest -> best
littler -> smaller
littlest -> smallest
The translation of the first two example sentences would be:
Nilix translator 1.6 (2013/08/09 10:14:57)
Copyright © 1983-2013 Dr. med. O'Niel Som
www.nili.com * www.nili.de
kay allim runa qa huk uchum misi ta rikun.
The good man sees a little cat.
kay aswan allim runa qa kay lliw uchum misi ta rikuptin.
The better man would see the smallest cat.
Comparative and equative phrases
An equative phrase is of the type "A is B", where A is a noun our pronoun, and B is a noun or an adjective. The
Simple Translation Language uses "qa" for the element A and the case marker "mi" for the element B.
In English, there is no special case for this equative phrase type, so we can translate is as if it were an
accusative object or an adverb.
kay runa qa allim mi kan.
The man is good.
To make a comparison, we use the postposition "hina" (as):
kay runa qa kay warmi hina allim mi kan.
The man is as good as the woman.
The comparative uses the postposition "mantas" (than) instead.
kay runa qa kay warmi mantas aswan allim mi kan.
The man is better than the woman.
In the superlative, we use a genitive expression as in English.
kay runa qa tukum runakuna pa lliw allim mi kan.
The man is (the) best of all men.
Positive, comparative and superlative adjectives may be used with other verbs than "kan" (to be):
kay runa qa kay warmi mantas aswan utqatam purin.
The man goes faster than the woman.
First, we define the three case markers in the LEXICON section:
hina C04------ as
mi C05------ equative
mantas C06------ than
Then, we combine them with a noun phrase to form a CMP (comparative object) in a SYNTAX section:
SU*C05 -> CMP (001,nom)
SU*C06 -> CMP (002,001,acc)
We do not say "The man is good as the woman" but "The man is *as* good as
the woman", so in English, "as" comes before and after the adjective. We use the accusative marker because in
English, there is a tendency to say "as/than me" rather than "as/than I".
SU*C04AD. -> CMPAD. (002,001,acc+002,003)
SU*C04 -> CMP (002,001,acc)
ADJC05 -> ADV (001)
We must change the position of a comparison object and an adjective. When no adjective follows, we can treat the
comparison object as an adverb.
CMPAD* -> ADV (002,001)
CMP -> ADV (001)
Now, all the rules for comparative sentences are defined. Let us try to translate the rest of test-07.txt. As you
will see, the article "the" is not yet inserted automatically before the superlative "best":
Nilix translator 1.6 (2013/08/09 10:14:57)
Copyright © 1983-2013 Dr. med. O'Niel Som
www.nili.com * www.nili.de
kay runa qa allim mi kan.
The man is good.
kay runa qa kay warmi hina allim mi kan.
The man is as good as the woman.
kay runa qa tukum runakuna pa lliw allim mi kan.
The man is best of all men.
kay runa qa kay warmi mantas aswan allim mi kan.
The man is better than the woman.
kay runa qa kay warmi mantas aswan utqatam purin.
The man goes faster than the woman.
kay allim runa qa huk uchum misi ta rikun.
The good man sees a little cat.
kay aswan allim runa qa kay lliw uchum misi ta rikuptin.
The better man would see the smallest cat.
CHAPTER 8
rules-08.txt and test-08.txt
Postpositions and adverbial phrases
In the Simple Translation Language, there are no prepositions as in English, but postpositions. You have seen some
of them in the previous chapter "hina", "mantas". We will now define postpositions with locative, directional and
temporal meaning.
Looking at the USER LEXICON, you will find a section for postpositions:
p = POS------
We can all basic postpositions of rules 4 and 5 of the grammar into the USER LEXICON. Some of them end
in "-m" for temporal meaning, even if the translation in English sometimes is the same. English compund prepositions
use the symbol § that will be substituted by a blank automatically.
awqa p against
chimpa p opposite
chawpi p between
chawpim p between
hanan p on
hawa p outside
kama p until
kamam p until
karu p far§from
lloqe p on§the§left§of
man p to
manta p from
mantam p since
naq p without
neq p around
nta p through
ntam p through
ñawpa p before
ñawpam p before
paña p on§the§right§of
paq p for
paqranti p instead§of
pi p in
pim p in
pura p among
qaylla p near
qayllam p at
rayku p because§of
siki p behind
sikim p after
uku p inside
ura p under
wan p with
All postpositions move to the beginning of the noun phrase, and they are used with the accusative case (which is
only important for pronouns). Postpositional phrases are treated as adverbs for their position in the sentence. This
can be achieved by only one rule:
S**POS -> ADV (002,001,acc)
In the case a postposition is used alone, without noun phrase, we should change their symbol to ADV (adverb) and
append a neutral pronoun as "it" or "that":
POS -> ADV (001,THT)
The THT will be substituted in the FUNCTIONS section to "that". The translation of two test phrases in test-08.txt
would be:
Nilix translator 1.6 (2013/08/09 10:14:57)
Copyright © 1983-2013 Dr. med. O'Niel Som
www.nili.com * www.nili.de
kay runa qa kay wasi pi kan.
The man is in the house.
warmikuna qa huk wasi man purirqan.
Women goed to a house.
Oops: "goed" is wrong, the past tense of "go" should be "went", so we must include this form in the IRREGULARITY
section.
goed -> went
CHAPTER 9
rules-09.txt and test-09.txt
Relative clauses and subordinate phrases
Whereas relative phrases in original Quechua are expressed uniquely by a participle which stands in front of the noun
(e.g. the loved friend = the friend who is loved, or the loving friend = the friend who loves), in the Simple
Translation Language, we find relative pronouns as in English.
The relative pronoun is "wak" and receives a case marker as every other noun phrase. I have decided to assign a SN
symbol to it, as the personal pronouns have.
All subordinate phrases end with "miki". Different from English, each relative clause is surrounded by commas.
kay runa qa, wak qa kay wasi pi kan miki, misikuna ta rikun.
The main phrase is "kay runa qa misikina ta rikun" (the man sees cats), and the subordinate relative phrase is "wak
qa kay wasi pi kan miki" (which is in the house). To translate relative clauses, we need only define the appropriate
words in the LEXICON:
wak SN0------ which
miki PHE------ subordinate§end
Conjunctions can be includes as e.g.:
c = CON------
hinapas c although
hinapti c if
hinaraq c before
hinarayku c because
hinaspa c so§that
hinasqa c after§that
hinasti c when
hinataq c that
Because relative phrases are recursive, I have tried a lot of possibilities and finally, I came to the conclusion
that the subject-verb congruence rules must be rewritten.
Consider an example:
subject-1 relative-clause [subject-2 verb-2] object-1 relative-clause [subject-3 verb-3] verb-1
Our approach so far begins to search a subject (subject-1) and then reaches at the nearest verb following (here verb-2). So it
will generate a wrong subject-verb concord. Instead of searching all the phrase for the next verbal form, we can
change the rules so that the subject and verb stand directly together before concord rules apply.
The first step is to differentiate three types of subject by a symbol:
NO1 first person subject ("I")
NOS singular subject (third person)
NOP plural subject (including "you")
These symbols are generated at the stage of case markers in a SYNTAX section.
SUPC01 -> NOP (001,nom)
SUBC01 -> NOS (001,nom)
SN1C01 -> NO1 (001,nom)
S**C01 -> NOS (001,nom)
To simplify the following rules, I tried to treat adverbs, adjectives and accusative objects the same:
ACC -> ADV (001)
ADJ -> ADV (001)
A**A** -> ADV (001,002)
The congruence rules can be in a SYNTAX section, as no discontinuous rules are required. First, we change the
position of finite verb (FIN), adverb (ADV, including former accusative objects and adjectives), and infinitive
(INF).
INFFI. -> FI.INF (002+001)
ADVFI. -> FI.ADV (002+001)
ADVINF -> INFADV (002+001)
Now the subject should stand in front of the verb. The three subject types are treated straightforward:
NO1FIN -> NOMFIV (001+002,fir)
NOPFIN -> NOMFIV (001+002,plu)
NO*FIN -> NOMFIV (001+002,thi,sin)
Now all subjects should be marked as NOM, and all congruent verbs as FIV. The relative clauses have not yet been
embedded to the preceding noun phrase.
We rewrite the last SYNTAX section, just before the FUNCTION section indicates the end of syntactic transformation.
(1) change the position of adverb, finite verb and infinitive once again
ADVFININF -> FININFADV (002+003+001)
ADVINF -> ADV (002,001)
INFADV -> ADV (001,002)
ADVADV -> ADV (001,002)
SSSADV -> SSS (001,002)
(2) treat questions before declarative sentences
NOMFIVADVQQQ -> SSS (002,001,003)
NOMFIVQQQ -> SSS (002,001)
... QQQ -> ... (001)
(3) in simple sentences, change SOV to SVO
NOMADVFIV -> SSS (001,003,002)
NOMFIV -> SSS (001,002)
ADVSSS -> SSS (001,002)
(4) delete KOM (comma), PHE (subordinate phrase end), REL (relative marker) and attach a relative clause RSS to the preceding noun
KOMREL -> REL (002)
PHEKOM -> PHE (001)
RELSSSPHE -> RSS (002)
RELSSS -> RSS (002)
SSSPHE -> SSS (001)
...REL -> ... (001)
...RSS -> ... (001,002)
After attaching the relative clause to the preceding noun, the embedded structure disappears from the surface and
another syntax cycle is initiated. Then, the main subject and main verb are treated to form concordance. This
automatic repitition of all rules is an important feature for all embedded structures. The repitition terminates
only, when no changes were made.
Now, finally, we can translate the three sentences of test-09.txt correctly:
Nilix translator 1.6 (2013/08/11 11:26:50)
Copyright © 1983-2013 Dr. med. O'Niel Som
www.nili.com * www.nili.de
kay runakuna qa huk misi ta, wak qa kay wasi pi kan miki, mana rikunqan.
The men won't see a cat which is in the house.
kay runa qa misikuna ta, wakkuna qa kay wasi pi kan miki, mana rikun.
The man doesn't see cats which are in the house.
kay runa qa, wak qa kay wasi pi kan miki, misikuna ta rikun.
The man which is in the house sees cats.
hinapas kay runa qa allim kan miki, kay warmi qa allim mana kan.
Although the man is good the woman is not good.
CHAPTER 10
rules-10.txt and test-10.txt
Questions with interrogative pronouns
In chapter 6, we have considered yes/no questions only. In the Simple Translation Language, all interrogative
question words consist of 2 or more components. We define them as usual in the USER LEXICON:
i = INT------
is = INS------
- may pi
i where
- may man
i where
- may manta
i where§from
- ima pacha pi
i when
- ima pacha mantam
i since§when
- ima pacha kama
i until§when
- ima qa
is what
- piy qa
is who
- piy ta
i whom
- ima ta
i what
- ima rayku
i why
imam a which
I have chosen the abbreviation is = INS for subject pronouns, since they do not require a "to do" question. As in
Quechua, the interrogative word ends in "taq", which we treat as a normal termination and discard it later. The word
"imam" (which) comes before other nouns and is treated as an adjective.
taq TAQ------ interrogative§pronoun
The SYNTAX rules are simple:
INTTAQ -> INT (001)
INS -> NOS (001,nom)
ACCTAQ -> INT (001)
...TAQ -> ... (001)
The COMPLEX rule to introduce "do" in questions has to be reformed. Before introduction of "do", we disregard a
question (QQQ) when a subject interrogative pronoun (INS) or "imam" (which) in the subject phrase is present.
2 +
NO* #INS
QQQ
unchanged_
delete_
2 +
NO* IMAM
QQQ
unchanged_
delete_
For object questions ("what can you see"), we must change the position of the constituents before subject-verb congruence, because the
question word comes in between.
NO.INTFI. -> INTNO.FI. (002+001+003)
The final order of the question would be "what you can see", so we change the position once again in the SYNTAX
section that corrects subject-ver-object order in a phrase:
INTNOMFIV -> SSS (001,003,002)
I have discovered that "rikurqan" was translated to "sees", but "saw" would be correct. The cause is, that the verb
received the attributes "pas thi sin" which stand for "past tense, third person, singular" and that "thi,sin" is
evaluated before "pas". So we must change that in the INFLECTION section:
VEE (pas) -> -d
VE* (pas) -> -ed
VES (thi,sin) -> -es
VE* (thi,sin) -> -s
Then, "rikurqan" translated to "seeed". Of course, "see" is an irregular verb, so we should include it in the
IRREGULARITIES section:
seeed -> saw
Now, our translation of the sample phrases succeeds:
Nilix translator 1.6 (2013/08/11 11:26:50)
Copyright © 1983-2013 Dr. med. O'Niel Som
www.nili.com * www.nili.de
qam qa ima ta taq rikuna atin chu?
What can you see?
may pi taq qam qa kay misi ta rikuna atin chu?
Where can you see the cat?
piy qa taq kay misi ta rikun chu?
Who sees the cat?
qam qa kay misi ta rikun chu?
Do you see the cat?
ima pacha pi taq qam qa kay misi ta rikun chu?
When do you see the cat?
imam runa qa kay misikuna ta rikurqan chu?
Which man saw the cats?
CHAPTER 11
rules-11.txt and test-11.txt
Participles
There are two participles in the Simple Translation Language:
rikusti seeing
riskusqa seen
Both are used only as adjectives, but not to form compound tenses or abbreviated relative sentences. The translation
is simple:
kay riskusti runa qa (the seeing man)
kay rikusqa runa qa (the seen man)
We define two items in the TERMINATION section:
-sqa E16 ppp V..
-sti E17 ppa V..
In the SYNTAX section which analyzes all terminations, we transform them into adjectives:
V**E16 -> ADJ (001,ppp)
V**E17 -> ADJ (001,ppa)
And in the INFLECTION section, we include some rules to generate the forms.
VEE (ppa) -> <-ing
VE* (ppa) -> -ing
VEE (ppp) -> -d
VE* (ppp) -> -ed
The symbol < before -ing means: delete the last letter (which is an "e") from the verb. What would happen, if we try
to form the ppp of an irregular verb as "riku" (see)? It would generate the wrong form "seeed" that is changed by
one of our previous rules in the IRREGULARITY section to "saw". We need to create an inflectional class for
irregular verbs which have a different form for past tense and ppp, as "see - saw - seen".
We could invent that the second letter of the inflectional class VER changes to "I" (for irregular)
riku VER---VIE see
and then, create a different rule for irregular verbs:
VIE (ppp) -> -n
VI* (ppp) -> -en
Finally, of course, we need to include such forms into the IRREGULARITY section, if they do not match exactly the
-n/-en ending.
The result of "see" + "-d" would be "seed" and this one is changed to "saw". But, wait a moment: "seed" is a correct
word with another meaning, so it should be better to declare
riku VER---VIR see
resulting in "seeed", which can always be converted to "saw".
How could we define the entry of "ranti" (buy)? Probably, what comes to mind first:
ranti VER---VIR buy
Giving the forms "buys, buying, buyed, buyen", the last two would be changed to "bought, bought". As you see, the
irregular forms for past tense and ppp are the same, so we change to
ranti VER---VER buy
and have to substitute in the IRREGULARITY section only one entry
buyed -> bought
If you look now in rules-11.txt, then you will find another entry for "buy" in the USER LEXICON because "buy" can be
used as noun, too:
ranti n *v buy *s buy
Reflexive pronouns
In the Simple Translation Language, there is only one reflexive pronoun "kiki". In original Quechua, this receives
an possessive ending, as "kikinku" (their self = themselves). We have to generate an adequate reflexive pronoun
according to the subject of the phrase.
kiki SNR------ kiki
As we will see, forms like "myself, yourself" are really accusative objects. We wait, until the accusative object
is recognized by SYNTAX rules. In the first COMPLEX section, we can include our rules:
for the first person (NO1="I"):
2 >
NO1
A** KIKI
unchanged_
replace_1MYSELF=REF------myself
You may see a rule for all third person subjects in the singular (which are not "he" or "she"):
2 >
NOS
A** KIKI
unchanged_
replace_1ITSELF=REF------itself
The program generates "itself", which is not correct, when we have an animate subject such as "man" or "woman". In
these cases, we could mark every noun as "MAS" or "FEM" in the semantic section of each grammar string if it is a
human being:
runa SUBMASSUB man
WARMI SUBFEMSUB woman
Then, we could expand a rule for masculine and feminine humans as:
2 >
NOS ^MAS
A** KIKI
unchanged_
replace_1HIMSELF=REF------himself
CHAPTER 12
rules-12.txt and test-12.txt
Unknown words
We can now expand the LEXICON and USER LEXICON to include all 300 entries of the Simple Translation Language and
voilà: the translation is almost ready.
The last thing I would like to show you is how to translate unknown words. No electronic translation program and
no dictionary in the world is complete. Almost always, there are words which cannot be translated by simply looking
up their meaning in the list. Think about names of persons, locations, chemical formulae and so on.
The Simple Translation Language uses the Spanish vocabulary, where no appropriate words are found. Many words in
English and in Spanish have the same origin (Latin, Greek or Arabic), sometimes with a slightly different ending or
spelling. To avoid 20000 almost identical entries of the form
nacion s nation
localizacion s localization
vision s vision
universidad s university
we should define a list of endings, which can be substituted by the program automatically. Remember that no accents
are used in the Simple Translation language.
cion > tion SUB
ion > ion SUB
dad > ty SUB
This type of entry are written into an UNKNOWN WORDS section (which must be included just before the first SYNTAX
section). Each entry consists in two lines:
DADES DAD E20
SUB---SUY#ty
DAD
SUB---SUY#ty
So "universidades" is transformed to "university", then the plural "universities" would be formed in the target
language by usual inflection rules.
The termination -dades is split into -dad and ending E10 (plural). In the second line, you find the grammr string.
The # means here: the translation is lower case; % would mean upper case (for names).
It is also possible to include ambiguous suffixes (here -al may be an adjective or noun).
AL
A/S*ADJ---ADJ#al*SUB---SUB#al
Often, verbs of the type -ar are transformed to English -ate, e.g. "communicate", "terminate", "congratulate".
The rules would include all possible inflectional forms, that is with tense markers. The longest entries come first:
ARQAN A E12
VER---VEE#ate
ANQAN A E11
VER---VEE#ate
APTIN A E13
VER---VEE#ate
AN A E10
VER---VEE#ate
In rules-12.txt, you will see some other transformations. Unknown words in test-12.txt are "universidad",
"inteligente", "diccionario", "comunicar". They are tagged by Nilix automatically by a following _UU symbol, which
does not interest here, so we can delete it in an IDIOMS section
#_UU
E
Now, you could try to translate the two sample sentences. Of course, the conversion of unknown words is not always
perfect, e.g. "diccionary" should be "dictionary" and "comunicate" should be "communicate" and "inteligent" should
be written with double "l".
Nilix translator 1.6 (2013/08/12 12:47:53)
Copyright © 1983-2013 Dr. med. O'Niel Som
www.nili.com * www.nili.de
kay universidades qa chaypi kan.
The universities are here.
inteligentem runakuna qa diccionario naq mana comunicarqan.
Inteligent men didn't comunicate without diccionary.
Chapter 13
rules-13.txt and test-13.txt
More natural grammar
For a Quechua speaker, the phrases presented so far may seem quite strange. I decided to make the constructed
Simple Translation Language more natural.
There are only four rules that will be changed.
1. verbs
So far, we have the verb only in the original third person singular form. The whole paradigm of personal inflection
is:
-ni I
-nki you (singular)
-n he/she/it
-niku we
-nchik we all, one
-nkichik you (plural)
-nku they
and so on for past tense, future tense and conditional mood. As the information contained in these suffixes is not
used (because an obligatory subject must be present in each phrase), we can substitute them all by a single ending.
COE
-ni E10 present
-nki E10 present
-n E10 present
-niku E10 present
-nchik E10 present
-nkichik E10 present
-nku E10 present
In the file rules-13.txt, you will see a lot of other possible forms for past, future and conditional tense.
Besides the ending itself, we should include these endings in the section for unknown words so that forms like
"conversarqaku" can be recognized as verbs in past tense.
2. adjectives
Adjectives receive the ending -m only in predicate use as "the tree is small". Whenever an adjective comes in front
of a substantive, it will have no ending like in "the small tree is here".
kay sacha qa uchum kan.
the tree is small.
kay uchu sacha qa kaypi kan.
the small tree is here.
This leads to an ambivalent use of words which can be used as nouns or adjectives. So we need disambiguation rules.
Polyvalent words have the syntactic symbol V/N.
V/N SUB -> ADJ SUB (001+002) > 001=ADJ
V/N V/N -> ADJ V/N (001+002) > 001=ADJ
V/N E3. -> ADJ E3. (001+002) > 001=ADJ
V/N E0. -> SUB E0. (001+002) > 001=SUB
V/N E1. -> VER E1. (001+002) > 001=VER
3. Articles
Quechua does not use the articles "huk" and "kay". In reality, they are the number "one" and a demonstrative pronoun
"this". I want to let it for you to resolve, how we could modify the Simple Translation Language in manner that the
definite article "kay" is not used anymore.
Consider the following cases, where English does not use an article:
• noun phrase in plural
• noun phrase with pronoun (e.g. "qam=you")
• noun phrase with proper name (e.g. person, geographic, language)
• noun phrase with number (including "huk" = "one")
• noun phrase with adjectives meaning "much, many" (e.g. "achka")
• noun phrase with possessive (e.g. "ñoqapa=my") or demonstative pronoun (e.g. "chay=this")
We have to create a rule, which generates the definite English article whenever none of the cases mentioned above is
true.
1 +
SUB /#SN* /#NAM /#DE* /#REF /plu /#NUM /ACHKA
I0DEB------the
4. subject marker "qa" after personal pronoun
In Quechua, the "qa" is a topic marker, and may follow the subject or other parts of the sentence. In the Simple
Translation Language, "qa" is a necessary subject marker. Now, we will make it optional after personal pronouns.
After the rules which recognize C01 ("qa") as the subject marker, we could include this rule.
SN. -> NO. (001,nom)
Ok, now you can translate the file test-13.txt
Epilogue
Congratulations! You have followed me writing a reasonably complete set of grammar rules for the automatic high
quality natural language translation using the software Nilix Translator and the artificial Simple Translation
Language. Now it is up to you, to complete the rules and the dictionary and to create rules for other language
pairs.
Comments on this text are welcome to
. (it will not be possible for me to answer all emails because I
work in a completely different area everyday). Please copy this text, if you want, and distribute it to other
persons interested in language translation, but mention the source. My address is:
Dr. O'Niel Som
Am Lauersgraben 26c
96450 Coburg
Germany
Copyright © 2025 Dr. med. O'Niel Som Verlag · Goethestr. 7 · 68723 Plankstadt
www.nili.de ·
www.nili.com ·
E-Mail
Nili ist registriertes Markenzeichen von Dr. O'Niel Som.
Alle anderen erwähnten Markenzeichen gehören ihren jeweiligen Eigentümern und werden ohne Kennzeichnung gebraucht.