Simple Translation Language for Nilix Translator

Sie sind hier: Dr. O'Niel Som Verlag Homepage : Automatische Sprachübersetzung

The automatic language translation kit

INTRODUCTION

simpletranslationlanguage.txt

In this text, I will show you how to use a rule based translation program. This program is called Nilix translator and has been developed for 30 years now in 2013 (download the complete kit as tar.gz for Linux here or download as zip file for windows). The main focus of this text is how to create a set of rules to achieve a full automatic high quality translation. As natural languages are very complex, I want to introduce an easier language which I have constructed for this purpose.

The Simple Translation Language

The source language consists of only about 300 lexical items and 10 grammatical rules. The words and grammar have been adopted from Quechua, spoken in Peru, Bolivia, Ecuador and parts of Argentina and Columbia. Quechua is an agglutinative language, which in its original form is quite complicated. For example, a single verb can include a subject, object and a determiner as in the phrase "kuyaykim" (I love you!). The root "kuya" means "love", the transitional suffix "yki" means "I to you" and "m" is an assertative marker which emphazises the personal experience.

Of course, such a complex language would not be adequate as a Simple Translation Language. Because of that, I simplified the language. The whole short description of the language, including grammar, pronounciation and basic vocabulary, can be found in the file "simpletranslationlanguage.txt".

The ten main grammar rules are:

01. The word order is genitive-demonstrative-number-adjective-noun-relative clause, negation-adverb-verb/adjective, conjunction-subject-object-infititive-verb-marker, subordinate-main clause.
02. Questions append the marker "chu", subordinate sentences "miki", reported speech "si" and exclamations "ya".
03. The negation "mana" precedes the verb, adverb, adjective or noun. Generally, modifiers precede the modified expression.
04. Nouns and pronouns form the plural by "kuna" appended directly to the stem. Noun phrases end in "qa" (subject), "ta" (object), "mi" (equative), "ya" (vocative).
05. Postpositions are "pa" (of), "pi" (in), "man" (to), "manta" (from), "wan" (with), "paq" (for), "kama" (until), "nta" (through). Temporal postpositions add "-m".
06. Articles are "kay" (the) and "huk" (one/a). Numbers "iskay" (2), "kimsa" (3), "tawa" (4), "pichqa" (5), "soqta" (6), "qanchis" (7), "pusaq" (8), "isqon" (9), "chunka" (10), "pachak" (100), "waranqa" (1000), "eje" (0).
07. Noun compositions use a hyphen "-". Names are followed by a marker as "llaqta" (location), "runa" (human) or "qari" (man) or "warmi" (woman), "uywa" (animal), "sacha" (plant).
08. "piy" (who), "ima" (what), "may" (where) end in "taq" (interrogative), in "pas" (indefinite), without ending (relative pronoun). "tukuy" (every) and "mana" (no) may stand in front of them. "kiki" (self) is the reflexive pronoun.
09. Adjectives end in "m", derived adverbs in "tam". Comparisons use "aswan" (more), "lliw" (most) and the postposition "hina" (as), "mantas" (than).
10. Verb endings are "n" (present), "rqan" (past), "nqan" (future) and "ptin" (conditional), "y" (imperative), "na" (infinitve), "sqa" (ppp) and "sti" (ppa), "q" (actor).

The pronunciation rules are:

"q" as Scottish "loch", "ll" is "ly", "ñ" is "ny", "ch" as in "church", "r" (rolled) and vowels as in Spanish, i.e. "a" as in "father", "e" as in "bed", "i" as in "fit", "o" as in "for", "u" as in "put". Stress falls on the last but one syllable of the root, endings are not normally stressed.

The vocabulary has been reduced to about 300 words. All words not covered by this basic vocabulary or by combinations of them may be taken from Spanish. In this way, for a language enthusiast or amateur linguist, it will not be difficult to master the language at all. Words may change part of speech by different terminations.

Some differences between the Simple Translation Language and Quechua are:

1. verbs are not changed according to subject and object
2. the agglutination of suffixes is limited to one or two endings, most are written as independent words
3. possessive relations are not expressed by suffixes, but by pronouns
4. subject pronouns cannot be dropped; in Quchua, the subject is implicit in the verb ending
5. the subject is always marked by "qa" which is optional in Quechua
6. adjectives loose a possible end consonant, e.g. "hatun" (big) becomes "hatu", then "m" is appended
7. "kay" (this) and "huk" (one) are used as article
8. Quechua has a much larger vocabulary than 300 words

CHAPTER 1

test-01.txt and rules-01.txt

How to invoke the Nilix translator

There are various ways to start a translation with Nilix translator. It is necessary to open a text terminal and to store the sentences in a text file without formatting code. For example "test-01.txt". You may invoke the program by the command line

   ./nilixtranslator test-01.txt rules-01.txt

The program uses the rules which are in one file called "rules-01.txt". They are prepared and stored in machine readable format without comments into the directory "definitions" as various separate files.

The first examples

The output of the translation program will be something like:

   Nilix translator 1.6 (2013/08/07 09:32:32)
   Copyright © 1983-2013 Dr. med. O'Niel Som
   www.nili.com * www.nili.de
   
   ñoqa qa kay wasi ta rikun.
   I see the house. 
   
   kay runa pa kay wasi qa chaypi kan.
   The house of the man is here. 
   
   kay runakuna qa kay misi ta rikun.
   The men see the cat. 
   
   kay runa qa kay misikuna ta rikun.
   The man sees the cats. 
   
   añay ya.
   Thank you.

The words used in these example sentences are:

   nouns
   
   runa	man
   wasi	house
   misi	cat
   ñoqa	I
   
   verbs
   
   ka	to be
   riku	to see
   aña	to thank
   
   other
   
   kay	article (the)
   chaypi	here
   qa	subject
   ta	object
   ya	exclamation
   
   endings
   
   -kuna	plural
   -n	present tense
   -y	imperative

Original Quechua phrases would be:

runapa wasin chaypim.
runakunaqa misita rikunku.
runa misikunata rikun.
añaykim.
wasita rikuni.

The translation rules

The first translation rules are in the file "rules-01.txt". The main sections are explained here. Every section in the rules begins with a key word and terminates with a line containing only hyphens (---). The key words are:

   COMMENTS
   SUBSTITUTIONS
   LEXICON
   TERMINATIONS
   IDIOM
   SYNTAX
   COMPLEX
   FUNCTIONS
   INFLECTION
   IRREGULARITY
   FINAL SUBSTITUTIONS
   USER LEXICON

These section names may be singular or plural, but have to be UPPER CASE.

When you (or a script) call ./makenilix then these human readable rules are analysed and written to the directory "definitions" as various .nlx files as well as a main file "exec.nlx" which contains the order in which the rules are applied. Some of the sections may occur more than once: IDIOM, SYNTAX and COMPLEX.

The most important and interesting sections are LEXICON and USER LEXICON, SYNTAX, COMPLEX and INFLECTION which I will explain first.

Section LEXICON

Here are all grammatical word entries such as prepositions, auxiliary verbs and pronouns which will be treated in a special way. Adjectives, nouns, verbs and simple adverbs will be placed in the section USER LEXICON.

Nilix works internally with three grammatical categories for each word entry, consisting of 9 characters. Every combination of letters is possible, but they should be easy to remember.

syntax or part of speech: SUB, VER, ADJ, ADV
semantics: HUM, MAS, FEM
inflection: VER, VBE, VHA

The choice of these three-letter-abbreviations is free. There are some syntax symbols with a fixed meaning:

   E** ending
   ADP verb prefix (in languages as German)
   ZZZ unknown word

Upper case words may receive one of the following symbols if stored into the lexicon:

   NAM name 
   TIT title
   LIN language
   GEO geographic name
   HUM human
   DAY day name

The rule writer (i.e. you) must use the symbols consistently. For example, I use VBE for inflection of the verb "to be" and VHA for the verb "to have". The whole grammatical definition is written as a string of 9 letters:

   VER---VER (which means syntax=VER, semantics=void, inflection=VER)

Abbreviations for commoner patterns may be defined as

   x = ADV------

Each lexicon line contains a word entry as source language, grammar string, target language, each separated by a blank:

   ñoqa SN1---PN1 I
   kay DET------ the

Section USER LEXICON

Here is the main part of lexicon, mainly all adjectives, adverbs, nouns, and verbs. Each line contains an entry in the form: source language - grammar string - target language. Frequent grammar strings may be abbreviated before the abbreviation is used (similar to section LEXICON):

   v = VER---VER
   s = SUB---SUB
   
   runa s man
   misi s cat
   riku v see
   misi s cat

Irregular forms in the source language may be replaced by regular entries. Let us assume we had English as source language, then we could create an entry for the irregular plural "children":

   child s wawa
   children>child s wawa

Idioms consisting of two words may be stored in the user lexicon. The first line contains the words of the source language, and the second line is the replacement in the target language. For various words in the target language, the special symbol "§" must be used (which is converted later to a blank). The target language entry begins with a grammar string, this can be one of the abbreviations defined earlier.

   - qarim wawa
   s boy

This replaces the string "qarim wawa" by the substantive "boy", the grammar string is expanded from "s" to "SUB---SUB" according to the abbreviation introduced above. In the source line, "-" means the words follow one another immediately. As an alternative, "+" is used when there are possibly other words in between (such as article, negation etc.).

Section TERMINATIONS

For every inflected source language, there will be a lot of suffixes to strip from a word before lexical search. The longest terminations must come first.

   -nqan E11 future
   -rqan E12 past
   -ptin E13 conditional
   -n E10 present

A word in the source text ending in "-nqan" will be looked up in the lexicon without that termination, and if it is found, then two entries will be made: the first will be the verb, and the second will be the ending. Example: "rikunqan" is separated into "riku" (to see) and "nqan" (future ending). The resulting string table would be:

   RIKU VER---VER see
   NQAN E11------ future

From these entries, a syntax string is constructed, which will be the basis of the translation, in our example this would be VERE11.

Section SYNTAX

These are the main sections for the syntax analysis and transformation of a sentence. At the beginning of the syntax rules, the whole input sentence is stored in a two dimensional array like this:

   DEB	kay	the
   SUB	runa	man
   E01	qa	nominative
   DEB	kay	the
   SUB	misi	cat
   E02	ta	accusative
   VER	riku	see
   E10	present tense
   SEN	sentence end

One rule is written in one line.

   DEBSUB -> SUB (001,002)

This rule searches for the occurence of DEB and SUB following immediately and replaces them by the symbol NP1, which contains the elements in the order 001,002 (in this case, the order is not changed). The above sentence would be after the application of this rule:

   SUB	DET (kay-the) SUB (runa-man)
   E01	QA
   SUB	DET (kay-the) SUB (misi-cat)
   E02	TA
   VER	riku	see
   E10	present tense
   SEN	sentence end

You can use all quantors "*" to reduce the number of rules and to combine similar rules on similar symbols:

   DE*SUB -> SUB (001,002)

This rule would match DEM (demonstrative), DEB (determined article), DEU (indetermined article) and so on. Sometimes, you want to use one, two or three characters of the left side also on the right side of the rule. Then you can use the all quantor ".".

   SU.E10 -> SU. (001,plu)

This rule would substitute SUM E01 by SUM, and SUF E02 by SUF. On the right side, you see in parenthesis the use of an auxiliary word, here "plu" for plural, which is inserted automatically. To analyse the case markers for subject and object phrases, we can apply rules like these:

   SU*E01 -> NOM (001,nom)
   SU*E02 -> ACC (001,acc)

To change the order of constituents, you may apply a rule of this type:

   NOMACCFIV -> SSS (001,003,002)

This would replace the syntax string "NOM" (nominative) "ACC" (accusative) and "FIV" (finite verb) by the symbol "SSS" (sentence), and the order would be changed from subject-object-verb to subject-verb-object. All rules are evaluated repeatedly, until no further changes are made. Repitition is important for recognizing embedded structures such as relative clauses. It is the responsability of the rule writer, to avoid circular substitutions, which would not terminate as in:

   AAA -> BBB (001,xxx)
   BBB -> AAA (001,yyy)

Section COMPLEX

These rules involve entries which are not necessarily following one another. We need such discontinuous rules for example for generating a verb form according to its subject, in other words: for the subject-verb-congruence. In English, we have to consider only three cases:

   subject I -> verb first person singular (only for "I am" and "I was")
   subject singular (he, she, it or singular noun phrase) -> verb third person singular
   every other subject -> verb plural

As an example, I will show here the rule for putting the attribute "fir" for first person to the verb phrase.

   2 +
   NOM #SN1
   FIN /fir
   unchanged_
   symbol_FIV insert_fir insert_sin

The start line means: the rule consists of two entries, which follow one another (not necessarily immediately).

The first item to find is an object with the syntax string "NOM" and the attribute #SN1. The symbol "#" is used for the part of speech or syntax, which is the first three letters in the grammar string of every entry. The same symbol has been used in the IDIOM section. So, the word "ñoqa" which means "I" is in the lexicon as "SN1---SN1I". The second item to find is an entry tagged FIN (finite verb), which has not yet the attribute "fir". The slash means "not".

The last two lines of the rule are the actions. "unchanged_" means: do not change the line where NOM was found. The last line contains the three actions for the FIN found: "symbol_" change symbol here from FIN to FIV "insert_" append an short attribute, "fir" (first person) and "sin" (singular).

You can see the other similar rules to determine the concord of subject and verb for plural subjects and "you", for imperatives without subject and for the remaining subjects which must be third person singular in the file rules-01.txt in the COMPLEX section.

Remember that you may choose whichever abbreviation you like. So insted of "fir" you might have chosen "pe1" (person number one) or something similar. But consequently, you would have to use this symbol in all the other rules, too.

Section INFLECTION

In this section, the actual forms of words are generated according to grammatical attributes. As an example, the following rule generates for the inflecional category VBE (verb to be) and the attribute "plural" the word form "are". The "=" means substitute the whole entry.

   VBE (plu) -> =are

More often, endings are generated, using "-" as for the third person singular or noun plural "s"

   VE* (thi,sin) -> -s
   SU* (plu) -> -s

Section IRREGULARITY

After genertion of inflected forms, there are sometimes irregular words, which can be treated by simple substitution rules. For irregular verb forms of the verb "to have":

   haves -> has
   haved -> had

Or, irregular plural forms of substantives:

   mans -> men
   womans -> women

The remaining sections

Section COMMENTS

In this section, you will find comments for the human reader. They will be ignored by the program.

Section SUBSTITUTIONS

All characters are converted internally to upper case. The letters a-z are automatically transformed into A-Z. Some other substitutions might be vowels with accents or other diacritics:

   á -> A
   é -> E
   ñ -> NY

Section IDIOM

Idioms means here lexical entries which consist of various words. Not only words but also part-of-speech or any other grammatical information may be found. These rules are applied before syntax transformations begin. The rule consists of two lines. In the first line, we tell the program which items to find. The second line contains the actions to be taken. For example:

   #STR
   E

This rule means: search an entry tagged by the syntax (#) string STR (abbreviation for sentence-start), and if found erase that item (E). Another possible action is R (replace):

   #VDO
   R*VER---VDOdo

This second rule would search for the syntax symbol VDO (which might be used for verb-to do), and if found, replace it by an entry with grammar string VER---VDO and lexical entry in the target language "do".

Section FUNCTIONS

Some function words are substituted at the end of syntax transformations by real words, for example:

   POF PRE of

The symbol POF is generated and inserted by the program when it encounters a syntax rule of this form:

   GENSUB -> SUB (001,POF,002)

POF appears in upper case on the right side of the rule to indicate a complete new word in the target language, different to all entries in lower case (which are short grammatical attributes).

Section FINAL SUBSTITUTIONS

This section contains, as the name suggests, some substitutions which are necessary at the end of the application of all rules. For example, the "§" is substituted by a blank, and a blank before an interpunction is removed.

   § ->  
    . -> .

Normally, there are no changes required for this section.

How a translation is performed

When we translate a sentence, all rules which match the structure of the phrase are applied sequentially and repeatedly until the syntax string has been reduced to SSS. Let us consider the simple phrase:

   ñoqa qa kay wasi ta rikun.

The final translation is "I see the house". The following output has been generated by the verbose (or visible) option of the Nilix translator.

   ########################################################
   
   ñoqa qa kay wasi ta rikun.
   version = Nilix translator 1.6 (2013/08/07 09:32:31)
   pathname = definitions/
   visible = TRUE
   filename = pass1
   resultname = pass2
   shellmode = TRUE
   MODUS=C
   -----------------------------------------------------------------------------
   *: visible = false
   -----------------------------------------------------------------------------
   idio1.nlx: 
   words in idioms at beginning of complexes (idiome.pas)
    1            ZZ_START STR------(START)
    2               NYOQA SN1---PN1I
    3                  QA C01------nominative
    4                 KAY DEB------the
    5                WASI SUB---SUBhouse
    6                  TA C02------accusative
    7                RIKU VER---VERsee
    8             present E10------present
    9                   . ZZZ---ZZZ.
   
   Idiom rule #ZZZ
   -> R#NAM
   
   1 STR------(START)
   2 SN1---PN1I
   3 C01------nominative
   4 DEB------the
   5 SUB---SUBhouse
   6 C02------accusative
   7 VER---VERsee
   8 E10------present
   9 NAM---ZZZ.
   
   Idiom rule .
   -> R*SEN------§.
   
   1 STR------(START)
   2 SN1---PN1I
   3 C01------nominative
   4 DEB------the
   5 SUB---SUBhouse
   6 C02------accusative
   7 VER---VERsee
   8 E10------present
   9 SEN------§.
   
   -----------------------------------------------------------------------------
   *: show original words
          ZZ_START STR------(START)
             NYOQA SN1---PN1I
                QA C01------nominative
               KAY DEB------the
              WASI SUB---SUBhouse
                TA C02------accusative
              RIKU VER---VERsee
           present E10------present
               §. SEN------§.
   
   -----------------------------------------------------------------------------
   *: syntax start
   -----------------------------------------------------------------------------
   synt1.nlx: 
   syntax rule SN* -> SUB
   order: 001
   found at 2
   
   STRSUBC01DEBSUBC02VERE10SEN
   
    1 STR: ZZ_START 
    2 SUB: NYOQA 
    3 C01: QA 
    4 DEB: KAY 
    5 SUB: WASI 
    6 C02: TA 
    7 VER: RIKU 
    8 E10: present 
    9 SEN: §. 
   
   syntax rule VERE10 -> FIN
   order: 001pre
   found at 7
   
   STRSUBC01DEBSUBC02FINSEN
   
    1 STR: ZZ_START 
    2 SUB: NYOQA 
    3 C01: QA 
    4 DEB: KAY 
    5 SUB: WASI 
    6 C02: TA 
    7 FIN: RIKU pre 
    8 SEN: §. 
   
   -----------------------------------------------------------------------------
   synt2.nlx: 
   syntax rule DE*S.. -> S..
   order: 001002
   found at 4
   
   STRSUBC01SUBC02FINSEN
   
    1 STR: ZZ_START 
    2 SUB: NYOQA 
    3 C01: QA 
    4 SUB: KAY WASI 
    5 C02: TA 
    6 FIN: RIKU pre 
    7 SEN: §. 
   
   syntax rule S**C01 -> NOM
   order: 001nom
   found at 2
   
   STRNOMSUBC02FINSEN
   
    1 STR: ZZ_START 
    2 NOM: NYOQA nom 
    3 SUB: KAY WASI 
    4 C02: TA 
    5 FIN: RIKU pre 
    6 SEN: §. 
   
   syntax rule S**C02 -> ACC
   order: 001acc
   found at 3
   
   STRNOMACCFINSEN
   
    1 STR: ZZ_START 
    2 NOM: NYOQA nom 
    3 ACC: KAY WASI acc 
    4 FIN: RIKU pre 
    5 SEN: §. 
   
   -----------------------------------------------------------------------------
   comp1.nlx: 
   
   STRNOMACCFINSEN
   
    1 STR: ZZ_START 
    2 NOM: NYOQA nom 
    3 ACC: KAY WASI acc 
    4 FIN: RIKU pre 
    5 SEN: §. 
   
   idiom rule:   NOM #SN1 ù FIN /fir
   substitution: U ù SFIV Ifir Isin
   
   After processing: 
   1. operation U  in line 2
   Line indexes: 2 4 
   STRNOMACCFINSEN
   
    1 STR: ZZ_START 
    2 NOM: NYOQA nom 
    3 ACC: KAY WASI acc 
    4 FIN: RIKU pre 
    5 SEN: §. 
   
   After processing: 
   2. operation S FIV in line 4
   Line indexes: 2 4 
   STRNOMACCFIVSEN
   
    1 STR: ZZ_START 
    2 NOM: NYOQA nom 
    3 ACC: KAY WASI acc 
    4 FIV: RIKU pre 
    5 SEN: §. 
   
   After processing: 
   2. operation I fir in line 4
   Line indexes: 2 4 
   STRNOMACCFIVSEN
   
    1 STR: ZZ_START 
    2 NOM: NYOQA nom 
    3 ACC: KAY WASI acc 
    4 FIV: RIKU pre fir 
    5 SEN: §. 
   
   After processing: 
   2. operation I sin in line 4
   Line indexes: 2 4 
   STRNOMACCFIVSEN
   
    1 STR: ZZ_START 
    2 NOM: NYOQA nom 
    3 ACC: KAY WASI acc 
    4 FIV: RIKU pre fir sin 
    5 SEN: §. 
   
   -----------------------------------------------------------------------------
   synt3.nlx: 
   syntax rule NOMACCFIV -> SSS
   order: 001003002
   found at 2
   
   STRSSSSEN
   
    1 STR: ZZ_START 
    2 SSS: NYOQA nom RIKU pre fir sin KAY WASI acc 
    3 SEN: §. 
   
   -----------------------------------------------------------------------------
   synt4.nlx: 
   syntax rule ...SSS -> SSS
   order: 001002
   found at 1
   
   SSSSEN
   
    1 SSS: ZZ_START NYOQA nom RIKU pre fir sin KAY WASI acc 
    2 SEN: §. 
   
   syntax rule SSS... -> SSS
   order: 001002
   found at 1
   
   SSS
   
    1 SSS: ZZ_START NYOQA nom RIKU pre fir sin KAY WASI acc §. 
   
   -----------------------------------------------------------------------------
   *: select first entry of polyvalent syntax; syntax end
   -----------------------------------------------------------------------------
   func.nlx: 
   Syntax end, table before interpreting rules (syntax.pas)
   
   SSS
   
    1 SSS: ZZ_START NYOQA nom RIKU pre fir sin KAY WASI acc §. 
   
   Store normal f=nom
   Store normal f=pre
   Store normal f=fir
   Store normal f=sin
   Store normal f=acc
   -----------------------------------------------------------------------------
   idio2.nlx: 
   words in idioms at beginning of complexes (idiome.pas)
    1            ZZ_START STR------(START)
    2               NYOQA SN1---PN1I
    3                 nom aux---auxnom
    4                RIKU VER---VERsee
    5         pre fir sin aux---auxpre fir sin
    6                 KAY DEB------the
    7                WASI SUB---SUBhouse
    8                 acc aux---auxacc
    9                 §. SEN------§.
   
   Idiom rule #STR
   -> E
   
   1 SN1---PN1I
   2 aux---auxnom
   3 VER---VERsee
   4 aux---auxpre fir sin
   5 DEB------the
   6 SUB---SUBhouse
   7 aux---auxacc
   8 SEN------§.
   
   -----------------------------------------------------------------------------
   flex.nlx: 
   Test words[1] SN1---PN1I
   Test words[2] aux---auxnom
   Test words[3] VER---VERsee
   Test words[4] aux---auxpre fir sin
   Test words[5] DEB------the
   Test words[6] SUB---SUBhouse
   Test words[7] aux---auxacc
   Test words[8] SEN------§.
   -----------------------------------------------------------------------------
   irrg.nlx: 
   words in idioms at beginning of complexes (idiome.pas)
    1               NYOQA SN1---PN1I
    2                 nom aux---auxnom
    3                RIKU VER---VERsee
    4         pre fir sin aux---auxpre fir sin
    5                 KAY DEB------the
    6                WASI SUB---SUBhouse
    7                 acc aux---auxacc
    8                 §. SEN------§.
   
   -----------------------------------------------------------------------------
   *: show with part of speech
             NYOQA SN1---PN1I
               nom aux---auxnom
              RIKU VER---VERsee
       pre fir sin aux---auxpre fir sin
               KAY DEB------the
              WASI SUB---SUBhouse
               acc aux---auxacc
               §. SEN------§.
   
   I see the house. 
   ########################################################

CHAPTER 2

test-02.txt and rules-02.txt

Multiple parts of speech

One characteristic of the Simple Translation Language is that many words may be used as various parts of speech. In English, there is a similar feature, e.g.

love = noun, verb
round = adjective, verb, substantive
back = adverb, substantive, verb

Let us define an abbreviation for the commonest type noun/verb in the USER LEXICON:

   s = SUB---SUB
   v = VER---VER
   n = V/N
   
   wayllu n *s love *v love

Then there should be a disambiguation procedure. The easiest way to achieve this is to use a standard SYNTAX section.

   V/NC.. -> SUBC.. (001+002) > 001=SUB
   V/NE1. -> VERE1. (001+002) > 001=VER

The first rule says: whenever a V/N is followed by a case marker (C01="qa" etc.) then change its symbol to noun. In the second rule, the program will change every V/N to VER whenever a verb termination follows (E10="n"=present tense). If no disambuguation rule maches, then the first entry is chosen automatically at the end of the syntax transformation process. So be careful to write then entry in the USER LEXICON:

   wayllu n *s love *v love
   ENDCOE
   
   would be different from the following entry (where v=VER---VER is the default):
   
   CODE
   wayllu n *v love *s love

If you observe the rules in the TERMINATION section, you will see that there is another possibility to disambiguate words directly:

   -NQAN E11 future V..
   -RQAN E12 past V..
   -PTIN E13 conditional V..
   -N E10 present V..

The V.. at the end of each line means: if this ending is found then treat the word as verb.

Possessive pronouns

In the grammar rules for the Simple Translation Language, we have seen that the genitive case is formed by "pa". "wasi pa punku" means "door of the house" or "house door". The same genitive rule is applied to personal pronouns: "ñoqa pa wasi" means "my house". As the literal translation would be "house of me", we must invent a rule for English possessive pronouns.

The simplest way is to include two word lexical entries in the USER LEXICON. We could define an abbreviation "o" as part of speech for possessive pronouns.

   o = PPN------
   + ñoqa pa
   o my
   + qam pa
   o your

In the syntax rules, we have to include PPN in the recognition of noun phrases. In rules-01.txt, there were the following rules:

   ; build noun phrases
   
   ADJS.. -> S.. (001,002)
   DE*S.. -> S.. (001,002)
   NUMS.. -> S.. (001,002,plu)

Here, we would insert the line in this SYNTAX section:

   PPNS.. -> S.. (001,002)

But, as possessive pronouns in English are treated in the same way as demonstrative pronouns, we could use the part of speech DEM, so no new syntax rule would be required. Let us go back to the USER LEXICON and insert the following lines instead:

   d = DEM------
   + ñoqa pa
   d my
   + qam pa
   d your

With this definition, we can leave the SYNTAX section as it was before.

Various inflectional classes

When we now issue the command to translate our text file for this chapter, we get:

   Nilix translator 1.6 (2013/08/07 10:45:46)
   Copyright © 1983-2013 Dr. med. O'Niel Som
   www.nili.com * www.nili.de
   
   ñoqa qa qam ta wayllurqan.
   I loveed you. 
   
   ñoqa pa wayllu qa.
   My love. 
   ENDCOE
   
   Of course, you will see an error in the past tense of love: the program has simply added "-ed", so the wrong form
   "loveed" is the result.
   
   In English, verbs ending in "-e" are conjugated different in the past tense and in the gerund form (dropping the
   final "-e" before "-ing"):
   
   CODE
   love	love-s	love-d	lov-ing
   kiss	kiss-es	kiss-ed	kiss-ing
   seem	seem-s	seem-ed	seem-ing

In the inflecional part of the grammar string, we can define a different class for verbs ending in "-e" and also a class for verbs requiring "-es" in the present.

   v = VER---VER
   ve = VER---VEE
   vs = VER---VES

Next, the entry for "wayllu" would be changed to contain "ve" instead of "v". Then, we would construct different rules for our classes in the INFLECTION section:

   VES (thi,sin) -> -es
   VE* (thi,sin) -> -s
   VEE (pas) -> -d
   VE* (pas) -> -ed

To avoid unnecessary rules, the allquantor * is used after the specific rule to catch all remaining verbs.

CHAPTER 3

rules-03.txt and test-03.txt

Translating infinitives

In the Simple Translation Language, modal/auxiliary verbs are combined directly with a preceding infinitive. In English, there are infinitives with "to" and infinitives without "to", for example:

   ñoqa qa purina munan.
   I want to go.
   
   ñoqa qa purina atin.
   I can go.

In the LEXICON, some modal verbs can be inserted, disregarding for the moment the possibility of an usage as noun.

   muna VER---VER want
   ati VER---VER can
   tiya VER---VER must

It would be nice to mark the verbs which require a "to" infintive. We can do this in the semantic field not used so far:

   muna VERTOIVER want

where TOI means: "infinitive with to". The rule to recognize an infinitive is by the termination E15 (-na) in the corresponding SYNTAX section.

   V**E15 -> INF (001,inf)

Let us define now a COMPLEX section, which evaluates the combination INF FIN, infinitive and finite verb following one after another. In the items to recognize, we focus the program to compare only the semantic section, which is marked by ^.

   # test part of speech
   ^ test semantic
   ~ test inflection
   $ test the whole grammar string
   / negation

The rule can be written as this using the symbol PTO as the "preposition to" that may not be present yet:

   2 >
   INF /PTO
   FIN ^TOI
   insert_0PTO
   unchanged_

2 > means: there are two items, which follow immediately (no other words may come in between). insert_0PTO means: insert at position 0 (at the beginning) the item PTO. unchanged_ means: leave this line unchanged. Somewhere, the symbol PTO must be expanded to a full preposition "to", the adequte place for this is section FUNCTIONS:

   PTO PRE to

Finally, we must change positions of infinitive and finite verb in English by a SYNTAX rule to be inserted.

   INF FI. -> FI. (002,001)

We can translate now the first two sentences of text-03.txt correctly.

CHAPTER 4

rules-04.txt and test-04.txt

Generating English verb tenses

So far, we considered only present and past tense because their translation is straightforward. Looking at the future tense, we will see that we must generate an auxiliary verb in English. Here are all tenses of the verb "riku":

   rikun     see
   rikurqan  saw
   rikunqan  will see
   rikuptin  would see
   rikusti   seeing
   rikusqa   seen
   rikuy     see!
   rikuna    to see

To translate the future and conditional tense, all we have to do is to insert an auxiliary in the right place, that is at the end of the verb phrase because the normal order of constituents is infinitive-finite verb.

   VERE11 -> INFFIV (001,inf+WIL)
   VERE13 -> INFFIV (001,inf+WLD)

Now, you can translate the example phrases in test-04.txt. The result will be:

   Nilix translator 1.6 (2013/08/07 10:45:46)
   Copyright © 1983-2013 Dr. med. O'Niel Som
   www.nili.com * www.nili.de
   
   ñoqa qa huk misi ta rikunqan.
   I will see a cat. 
   
   qam qa misikuna ta rikuptin.
   You would see cats.

CHAPTER 5

rules-05.txt and test-05.txt

Negative phrases

In the Simple Translation Language, there is only one negation "mana", which is placed before the verb phrase. In English, we must consider two cases:

(1) no auxiliary verb is present, then add "to do"
(2) an auxiliary verb is there, then use only "not"

Auxiliary verbs, which can be negated, should be easily distinguishable. So we choose a symbol for them:

   can
   could
   must
   will
   would
   do
   does
   did
   am
   is
   are
   was
   were
   may
   might
   shall
   should

A similar procedure will be used to determine if a inverted word order is necessary in a question. Some of the forms mentioned in the list above will be generated by the program, such as WIL (will) and WLD (would). We should substitute them by complete entries with full grammar string. To do this only at syntax end in section FUNCTIONS would be too late.

One of various possibilities is the use of a semantic marker "AUX", placed in the middle of each grammar string just in the LEXICON section:

   ati VERAUXVCA can
   tiya VERAUX--- must
   ka VERAUXVBE be
   ruwa VERAUXVDO do

We will generate the auxilaries directly at then verb ending analysis, with the temporary syntax symbol XXX to be substituted later:

   ; future and conditional with auxiliary
   
   VERE11 -> INFXXX (001,inf+WIL)
   VERE13 -> INFXXX (001,inf+WLD)

WIL and WLD are expanded to full entries in the next COMPLEX section:

   ; substitute WIL and WLD
   
   1 +
   XXX WIL
   symbol_FIN insert_0VERAUX---will erase_1
   
   1 +
   XXX WLD
   symbol_FIN insert_0VERAUX---would erase_1

When we have all verb phrases analysed but before the subject - verb congruence, then only the symbols INF and FIN are used. The moment has come now to treat the negation in these cases: a) an auxiliary is there (delete NEG, and insert it directly after the auxiliary verb, the position indicator is 1+)

   2 >
   NEG
   FI* ^AUX
   delete_
   insert_1+NEG------not

b) no auxiliary (change NEG to FIN, insert "do" at the beginning, and mark the full verb as infinitive)

   2 >
   NEG
   FI* /^AUX pre
   symbol_FIN insert_0VERAUXVDOdo insert_pre
   symbol_INF erase_2 insert_inf
   
   2 >
   NEG
   FI* /^AUX pas
   symbol_FIN insert_0VERAUXVDOdo insert_pas
   symbol_INF erase_2 insert_inf

We include the forms of "do" and "can" into the INFLECTION section to generate its irregular forms:

   VDO (pas) -> =did
   VDO (thi,sin) -> =does
   VDO (ppp) -> =done
   
   VCA (pas) -> =could

Finally, you will see that the negative of "can" is translated as "can not", which should be "cannot", so another rule is included as IDIOM before section FINAL SUBSTITUTIONS:

   can 0not
   E R*VER------cannot

0not means: zero other words may come in between
E = erase
R = replace

What is not implemented yet: the future or conditional of "ati" (can), which we had to translate as "will/would not be able to...", but we can expand the last IDIOM rule by

   will 1can
   U R*VER------be§able§to
   
   would 1can
   U R*VER------be§able§to

1can means: one word may come in between (e.g. not).

The final translation of test-05.txt is then:

   Nilix translator 1.6 (2013/08/07 10:45:46)
   Copyright © 1983-2013 Dr. med. O'Niel Som
   www.nili.com * www.nili.de
   
   qam qa mana rikunqan.
   You will not see. 
   
   ñoqa qa mana rikun.
   I do not see. 
   
   kay runa qa mana rikurqan.
   The man did not see. 
   
   huk misi qa mana rikuna atin.
   A cat cannot see. 
   
   misikuna qa mana rikuna atirqan.
   Cats could not see. 
   
   misikuna qa mana rikuna atinqan.
   Cats will not be able to see. 
   
   misikuna qa mana rikuna atiptin.
   Cats would not be able to see.

CHAPTER 6

test-06.txt and rules-06.txt

Questions

For the translation of questions, there are two cases to consider (similar to negative phrases):

(1) with auxiliary
(2) without auxiliary

both of them in positive and negative sentences.

All questions terminate in "chu" in the Simple Translation Language. The first thing is to include this particle into the section LEXICON. We will use the symbol QQQ for it.

   chu QQQ------ question

We must treat the pronoun "you" as plural and supply a rule for it before treating all pronouns and names as nouns in the SYNTAX section. I have seen this error, when I tried to translate a "Do you..." question.

   SN2 -> SUP (001,plu)
   NAM -> SUB (001)
   SN* -> SUB (001)

The introduction of "do" and copying of the tense marker from the main verb to the auxiliary, is the same process as for negative phrases. We define a temporary symbol "DDD" for the "do" which has not yet received the appropriate tense and rewrite the rule for negative sentences to match this.

   ; introduce "do" (DDD) negatives without auxiliary in present and past. 
   ; T1=exchange this line and line 1 because the normal order is infinitive - finite verb
   
   2 >
   NEG
   FIN /^AUX
   symbol_DDD insert_0DDDAUXVDOdo
   symbol_INF insert_inf exchange_1
   
   ; introduce "do" (DDD) in questions when no auxiliary is present
   
   3 >
   *** /&INF
   FIN
   QQQ
   unchanged_  
   symbol_INF insert_inf after_*DDDAUXVDOdo
   unchanged_
   
   ; copy the tense marker (pre=present, pas=past) from the infinitive to the inserted "do", DDD->FIN
   
   2 >
   INF pre 
   DDD
   erase_1
   symbol_FIN insert_pre
   
   2 >
   INF pas
   DDD
   erase_1
   symbol_FIN insert_pas

At this point, all questions should possess an auxiliary verb. Its position is changed to the beginning of the phrase in the SYNTAX section, at the same time QQQ (="chu") is dropped.

   NOM *** INF FIN QQQ -> SSS (004,001,003,002)
   NOM INF FIN QQQ -> SSS (003,001,002)
   ... QQQ -> ... (001)

If you like, you may expand the rules for contracted negative forms. R> means: replace only the target language item.

   can 0not
   E R>cannot
   
   could 0not
   E R>couldn't
   
   will 1can
   U R>be§able§to
   
   would 1can
   U R>be§able§to
   
   do 0not
   E R>don't
   
   did 0not
   E R>didn't
   
   will 0not
   E R>won't
   
   would 0not
   E R>wouldn't

Now, we can successfully translate all phrases from test-06.txt.

   Nilix translator 1.6 (2013/08/07 10:45:46)
   Copyright © 1983-2013 Dr. med. O'Niel Som
   www.nili.com * www.nili.de
   
   qam qa kay misi ta rikun chu?
   Do you see the cat? 
   
   qam qa kay misi ta rikuna atin chu?
   Can you see the cat? 
   
   qam qa mana rikunqan chu?
   Won't you see? 
   
   ñoqa qa mana rikun chu?
   Don't I see?

CHAPTER 7

test-07.txt and rules-07.txt

Adjective inflection

At the beginning, we want to disambiguate adjectives wherever various parts of speech are possible.

   V/NE3- -> ADJE3. (001+002) > 001=ADJ

Up to now, we have used adjectives only in the basic form. In English, there are 3 derivations:

(1) adverb, -ly
(2) comparative, -er or more
(3) superlative, -est or most

In the Simple Translation Language, derived adjectives have the ending "tam" instead of "m", the comparative is formed by a preceding "aswan" and the superlative by "lliw". The comparative particle is "hina" (as) or "mantas" (than) or a genitive for the superlative as in English.

In the TERMINATIONS section, we add a line for adverbs:

   -tam E31 adverb
   -m E30 adjective

Some new entries go to the LEXICON section:

   aswan MOR------ more
   lliw MOS------ most

Let us analyse the adjective inflection first in a SYNTAX section.

   ADJE31 -> ADV (001,adv)
   MORAD. -> AD. (002,cmp)
   MOSAD. -> AD. (002,sup)

"aswan" and "lliw" might be used as adverbs receiving the ending "tam". In all other cases, when they are not used in front of an adjective, treat them as adjectives.

   MO.E31 -> ADV (001,adv)
   MO* -> ADJ (001)

To generate the inflectional forms, we have to consider two cases:

(1) comparative and superlative are formed by endings (-er, -est)
(2) by using "more" and "most"

As case (2) is much more frequent in English than case (1), we will mark only case (1) with a special inflection class. In this class, we must separate adjectives ending in -e (and adjectives which double their last consonant (as hot, hotter, hottest) or which change -y to -ier/-iest.

In the USER LEXICON, we can define various adjective classes and introduce some adjectives for test phrases.

   a = ADJ---ADJ (use more, most)
   ar = ADJ---ADR (-er, -est)
   ae = ADJ---ADE (-e)
   ad = ADJ---ADD (doubling consonant)
   ay = ADJ---ADY (-y)
   
   uchu ae little
   alli a good 
   kusi n *s happiness *ve amuse *ay happy

The rules for the INFLECTION section are (the symbol > means here: double the last character, then append an ending):

   ADR (cmp) -> -er
   ADR (sup) -> -est
   ADE (cmp) -> -r
   ADE (sup) -> -st
   ADD (cmp) -> >er
   ADD (sup) -> >est
   ADY (cmp) -> <-ier
   ADY (sup) -> <-iest
   ADY (adv) -> <-ily
   AD* (adv) -> -ly

But when we have an adjective which does not use ending in comparative and superlative, we have to recognize these before inflection and delete the "cmp" or "sup" attribute after generation of "more" and "most". We introduce a COMPLEX section rule that renames MOR/MOS to a simple adverb ADV, then the generation of "cmp" and "sup" is suppressed automatically.

   2 >
   MOR
   ADJ ~ADJ
   symbol_ADV
   U

Finally, there are irregular adjective comparisons and adverb formations, some of them are:

   goodly -> well
   gooder -> better
   goodest -> best
   littler -> smaller
   littlest -> smallest

The translation of the first two example sentences would be:

   Nilix translator 1.6 (2013/08/09 10:14:57)
   Copyright © 1983-2013 Dr. med. O'Niel Som
   www.nili.com * www.nili.de
   
   kay allim runa qa huk uchum misi ta rikun.
   The good man sees a little cat. 
   
   kay aswan allim runa qa kay lliw uchum misi ta rikuptin.
   The better man would see the smallest cat.

Comparative and equative phrases

An equative phrase is of the type "A is B", where A is a noun our pronoun, and B is a noun or an adjective. The Simple Translation Language uses "qa" for the element A and the case marker "mi" for the element B.

In English, there is no special case for this equative phrase type, so we can translate is as if it were an accusative object or an adverb.

   kay runa qa allim mi kan.
   The man is good.

To make a comparison, we use the postposition "hina" (as):

   kay runa qa kay warmi hina allim mi kan.
   The man is as good as the woman.

The comparative uses the postposition "mantas" (than) instead.

   kay runa qa kay warmi mantas aswan allim mi kan.
   The man is better than the woman.

In the superlative, we use a genitive expression as in English.

   kay runa qa tukum runakuna pa lliw allim mi kan.
   The man is (the) best of all men.

Positive, comparative and superlative adjectives may be used with other verbs than "kan" (to be):

   kay runa qa kay warmi mantas aswan utqatam purin.
   The man goes faster than the woman.

First, we define the three case markers in the LEXICON section:

   hina C04------ as
   mi C05------ equative
   mantas C06------ than

Then, we combine them with a noun phrase to form a CMP (comparative object) in a SYNTAX section:

   SU*C05 -> CMP (001,nom)
   SU*C06 -> CMP (002,001,acc)

We do not say "The man is good as the woman" but "The man is *as* good as the woman", so in English, "as" comes before and after the adjective. We use the accusative marker because in English, there is a tendency to say "as/than me" rather than "as/than I".

   SU*C04AD. -> CMPAD. (002,001,acc+002,003)
   SU*C04 -> CMP (002,001,acc)
   ADJC05 -> ADV (001)

We must change the position of a comparison object and an adjective. When no adjective follows, we can treat the comparison object as an adverb.

   CMPAD* -> ADV (002,001)
   CMP -> ADV (001)

Now, all the rules for comparative sentences are defined. Let us try to translate the rest of test-07.txt. As you will see, the article "the" is not yet inserted automatically before the superlative "best":

   Nilix translator 1.6 (2013/08/09 10:14:57)
   Copyright © 1983-2013 Dr. med. O'Niel Som
   www.nili.com * www.nili.de
   
   kay runa qa allim mi kan.
   The man is good. 
   
   kay runa qa kay warmi hina allim mi kan.
   The man is as good as the woman. 
   
   kay runa qa tukum runakuna pa lliw allim mi kan.
   The man is best of all men. 
   
   kay runa qa kay warmi mantas aswan allim mi kan.
   The man is better than the woman. 
   
   kay runa qa kay warmi mantas aswan utqatam purin.
   The man goes faster than the woman. 
   
   kay allim runa qa huk uchum misi ta rikun.
   The good man sees a little cat. 
   
   kay aswan allim runa qa kay lliw uchum misi ta rikuptin.
   The better man would see the smallest cat.

CHAPTER 8

rules-08.txt and test-08.txt

Postpositions and adverbial phrases

In the Simple Translation Language, there are no prepositions as in English, but postpositions. You have seen some of them in the previous chapter "hina", "mantas". We will now define postpositions with locative, directional and temporal meaning.

Looking at the USER LEXICON, you will find a section for postpositions:

   p = POS------

We can all basic postpositions of rules 4 and 5 of the grammar into the USER LEXICON. Some of them end in "-m" for temporal meaning, even if the translation in English sometimes is the same. English compund prepositions use the symbol § that will be substituted by a blank automatically.

   awqa p against
   chimpa p opposite
   chawpi p between
   chawpim p between
   hanan p on
   hawa p outside
   kama p until
   kamam p until
   karu p far§from
   lloqe p on§the§left§of
   man p to
   manta p from
   mantam p since
   naq p without
   neq p around
   nta p through
   ntam p through
   ñawpa p before
   ñawpam p before
   paña p on§the§right§of
   paq p for
   paqranti p instead§of
   pi p in
   pim p in
   pura p among
   qaylla p near
   qayllam p at
   rayku p because§of
   siki p behind
   sikim p after
   uku p inside
   ura p under
   wan p with

All postpositions move to the beginning of the noun phrase, and they are used with the accusative case (which is only important for pronouns). Postpositional phrases are treated as adverbs for their position in the sentence. This can be achieved by only one rule:

   S**POS -> ADV (002,001,acc)

In the case a postposition is used alone, without noun phrase, we should change their symbol to ADV (adverb) and append a neutral pronoun as "it" or "that":

   POS -> ADV (001,THT)

The THT will be substituted in the FUNCTIONS section to "that". The translation of two test phrases in test-08.txt would be:

   Nilix translator 1.6 (2013/08/09 10:14:57)
   Copyright © 1983-2013 Dr. med. O'Niel Som
   www.nili.com * www.nili.de
   
   kay runa qa kay wasi pi kan.
   The man is in the house. 
   
   warmikuna qa huk wasi man purirqan.
   Women goed to a house.

Oops: "goed" is wrong, the past tense of "go" should be "went", so we must include this form in the IRREGULARITY section.

   goed -> went

CHAPTER 9

rules-09.txt and test-09.txt

Relative clauses and subordinate phrases

Whereas relative phrases in original Quechua are expressed uniquely by a participle which stands in front of the noun (e.g. the loved friend = the friend who is loved, or the loving friend = the friend who loves), in the Simple Translation Language, we find relative pronouns as in English.

The relative pronoun is "wak" and receives a case marker as every other noun phrase. I have decided to assign a SN symbol to it, as the personal pronouns have.

All subordinate phrases end with "miki". Different from English, each relative clause is surrounded by commas.

   kay runa qa, wak qa kay wasi pi kan miki, misikuna ta rikun.

The main phrase is "kay runa qa misikina ta rikun" (the man sees cats), and the subordinate relative phrase is "wak qa kay wasi pi kan miki" (which is in the house). To translate relative clauses, we need only define the appropriate words in the LEXICON:

   wak SN0------ which
   miki PHE------ subordinate§end

Conjunctions can be includes as e.g.:

   c = CON------
   
   hinapas c although
   hinapti c if
   hinaraq c before
   hinarayku c because
   hinaspa c so§that
   hinasqa c after§that
   hinasti c when
   hinataq c that

Because relative phrases are recursive, I have tried a lot of possibilities and finally, I came to the conclusion that the subject-verb congruence rules must be rewritten.

Consider an example:

   subject-1 relative-clause [subject-2 verb-2] object-1 relative-clause [subject-3 verb-3] verb-1

Our approach so far begins to search a subject (subject-1) and then reaches at the nearest verb following (here verb-2). So it will generate a wrong subject-verb concord. Instead of searching all the phrase for the next verbal form, we can change the rules so that the subject and verb stand directly together before concord rules apply.

The first step is to differentiate three types of subject by a symbol:

   NO1 first person subject ("I")
   NOS singular subject (third person)
   NOP plural subject (including "you")

These symbols are generated at the stage of case markers in a SYNTAX section.

   SUPC01 -> NOP (001,nom)
   SUBC01 -> NOS (001,nom)
   SN1C01 -> NO1 (001,nom)
   S**C01 -> NOS (001,nom)

To simplify the following rules, I tried to treat adverbs, adjectives and accusative objects the same:

   ACC -> ADV (001)
   ADJ -> ADV (001)
   A**A** -> ADV (001,002)

The congruence rules can be in a SYNTAX section, as no discontinuous rules are required. First, we change the position of finite verb (FIN), adverb (ADV, including former accusative objects and adjectives), and infinitive (INF).

   INFFI. -> FI.INF (002+001)
   ADVFI. -> FI.ADV (002+001)
   ADVINF -> INFADV (002+001)

Now the subject should stand in front of the verb. The three subject types are treated straightforward:

   NO1FIN -> NOMFIV (001+002,fir)
   NOPFIN -> NOMFIV (001+002,plu)
   NO*FIN -> NOMFIV (001+002,thi,sin)

Now all subjects should be marked as NOM, and all congruent verbs as FIV. The relative clauses have not yet been embedded to the preceding noun phrase.

We rewrite the last SYNTAX section, just before the FUNCTION section indicates the end of syntactic transformation.

(1) change the position of adverb, finite verb and infinitive once again

   ADVFININF -> FININFADV (002+003+001)
   ADVINF -> ADV (002,001)
   INFADV -> ADV (001,002)
   ADVADV -> ADV (001,002)
   SSSADV -> SSS (001,002)

(2) treat questions before declarative sentences

   NOMFIVADVQQQ -> SSS (002,001,003)
   NOMFIVQQQ -> SSS (002,001)
   ... QQQ -> ... (001)

(3) in simple sentences, change SOV to SVO

   NOMADVFIV -> SSS (001,003,002)
   NOMFIV -> SSS (001,002)
   ADVSSS -> SSS (001,002)

(4) delete KOM (comma), PHE (subordinate phrase end), REL (relative marker) and attach a relative clause RSS to the preceding noun

   KOMREL -> REL (002)
   PHEKOM -> PHE (001)
   RELSSSPHE -> RSS (002)
   RELSSS -> RSS (002)
   SSSPHE -> SSS (001)
   ...REL -> ... (001)
   ...RSS -> ... (001,002)

After attaching the relative clause to the preceding noun, the embedded structure disappears from the surface and another syntax cycle is initiated. Then, the main subject and main verb are treated to form concordance. This automatic repitition of all rules is an important feature for all embedded structures. The repitition terminates only, when no changes were made.

Now, finally, we can translate the three sentences of test-09.txt correctly:

   Nilix translator 1.6 (2013/08/11 11:26:50)
   Copyright © 1983-2013 Dr. med. O'Niel Som
   www.nili.com * www.nili.de
   
   kay runakuna qa huk misi ta, wak qa kay wasi pi kan miki, mana rikunqan.
   The men won't see a cat which is in the house. 
   
   kay runa qa misikuna ta, wakkuna qa kay wasi pi kan miki, mana rikun.
   The man doesn't see cats which are in the house. 
   
   kay runa qa, wak qa kay wasi pi kan miki, misikuna ta rikun.
   The man which is in the house sees cats. 
   
   hinapas kay runa qa allim kan miki, kay warmi qa allim mana kan.
   Although the man is good the woman is not good.

CHAPTER 10

rules-10.txt and test-10.txt

Questions with interrogative pronouns

In chapter 6, we have considered yes/no questions only. In the Simple Translation Language, all interrogative question words consist of 2 or more components. We define them as usual in the USER LEXICON:

   i = INT------
   is = INS------
   
   - may pi
   i where
   
   - may man
   i where
   
   - may manta
   i where§from
   
   - ima pacha pi
   i when
   
   - ima pacha mantam
   i since§when
   
   - ima pacha kama
   i until§when
   
   - ima qa
   is what
   
   - piy qa
   is who
   
   - piy ta
   i whom
   
   - ima ta
   i what
   
   - ima rayku
   i why
   
   imam a which

I have chosen the abbreviation is = INS for subject pronouns, since they do not require a "to do" question. As in Quechua, the interrogative word ends in "taq", which we treat as a normal termination and discard it later. The word "imam" (which) comes before other nouns and is treated as an adjective.

   taq TAQ------ interrogative§pronoun

The SYNTAX rules are simple:

   INTTAQ -> INT (001)
   INS -> NOS (001,nom)
   ACCTAQ -> INT (001)
   ...TAQ -> ... (001)

The COMPLEX rule to introduce "do" in questions has to be reformed. Before introduction of "do", we disregard a question (QQQ) when a subject interrogative pronoun (INS) or "imam" (which) in the subject phrase is present.

   2 +
   NO* #INS
   QQQ
   unchanged_
   delete_
   
   2 +
   NO* IMAM
   QQQ
   unchanged_
   delete_

For object questions ("what can you see"), we must change the position of the constituents before subject-verb congruence, because the question word comes in between.

   NO.INTFI. -> INTNO.FI. (002+001+003)

The final order of the question would be "what you can see", so we change the position once again in the SYNTAX section that corrects subject-ver-object order in a phrase:

   INTNOMFIV -> SSS (001,003,002)

I have discovered that "rikurqan" was translated to "sees", but "saw" would be correct. The cause is, that the verb received the attributes "pas thi sin" which stand for "past tense, third person, singular" and that "thi,sin" is evaluated before "pas". So we must change that in the INFLECTION section:

   VEE (pas) -> -d
   VE* (pas) -> -ed
   VES (thi,sin) -> -es
   VE* (thi,sin) -> -s

Then, "rikurqan" translated to "seeed". Of course, "see" is an irregular verb, so we should include it in the IRREGULARITIES section:

   seeed -> saw

Now, our translation of the sample phrases succeeds:

   Nilix translator 1.6 (2013/08/11 11:26:50)
   Copyright © 1983-2013 Dr. med. O'Niel Som
   www.nili.com * www.nili.de
   
   qam qa ima ta taq rikuna atin chu?
   What can you see? 
   
   may pi taq qam qa kay misi ta rikuna atin chu?
   Where can you see the cat? 
   
   piy qa taq kay misi ta rikun chu?
   Who sees the cat? 
   
   qam qa kay misi ta rikun chu?
   Do you see the cat? 
   
   ima pacha pi taq qam qa kay misi ta rikun chu?
   When do you see the cat? 
   
   imam runa qa kay misikuna ta rikurqan chu?
   Which man saw the cats?

CHAPTER 11

rules-11.txt and test-11.txt

Participles

There are two participles in the Simple Translation Language:

   rikusti seeing
   riskusqa seen

Both are used only as adjectives, but not to form compound tenses or abbreviated relative sentences. The translation is simple:

   kay riskusti runa qa (the seeing man)
   kay rikusqa runa qa (the seen man)

We define two items in the TERMINATION section:

   -sqa E16 ppp V..
   -sti E17 ppa V..

In the SYNTAX section which analyzes all terminations, we transform them into adjectives:

   V**E16 -> ADJ (001,ppp)
   V**E17 -> ADJ (001,ppa)

And in the INFLECTION section, we include some rules to generate the forms.

   VEE (ppa) -> <-ing
   VE* (ppa) -> -ing
   VEE (ppp) -> -d
   VE* (ppp) -> -ed

The symbol < before -ing means: delete the last letter (which is an "e") from the verb. What would happen, if we try to form the ppp of an irregular verb as "riku" (see)? It would generate the wrong form "seeed" that is changed by one of our previous rules in the IRREGULARITY section to "saw". We need to create an inflectional class for irregular verbs which have a different form for past tense and ppp, as "see - saw - seen". We could invent that the second letter of the inflectional class VER changes to "I" (for irregular)

   riku VER---VIE see

and then, create a different rule for irregular verbs:

   VIE (ppp) -> -n
   VI* (ppp) -> -en

Finally, of course, we need to include such forms into the IRREGULARITY section, if they do not match exactly the -n/-en ending. The result of "see" + "-d" would be "seed" and this one is changed to "saw". But, wait a moment: "seed" is a correct word with another meaning, so it should be better to declare

   riku VER---VIR see

resulting in "seeed", which can always be converted to "saw". How could we define the entry of "ranti" (buy)? Probably, what comes to mind first:

   ranti VER---VIR buy

Giving the forms "buys, buying, buyed, buyen", the last two would be changed to "bought, bought". As you see, the irregular forms for past tense and ppp are the same, so we change to

   ranti VER---VER buy

and have to substitute in the IRREGULARITY section only one entry

   buyed -> bought

If you look now in rules-11.txt, then you will find another entry for "buy" in the USER LEXICON because "buy" can be used as noun, too:

   ranti n *v buy *s buy

Reflexive pronouns

In the Simple Translation Language, there is only one reflexive pronoun "kiki". In original Quechua, this receives an possessive ending, as "kikinku" (their self = themselves). We have to generate an adequate reflexive pronoun according to the subject of the phrase.

   kiki SNR------ kiki

As we will see, forms like "myself, yourself" are really accusative objects. We wait, until the accusative object is recognized by SYNTAX rules. In the first COMPLEX section, we can include our rules:

for the first person (NO1="I"):

   2 >
   NO1
   A** KIKI
   unchanged_
   replace_1MYSELF=REF------myself

You may see a rule for all third person subjects in the singular (which are not "he" or "she"):

   2 >
   NOS
   A** KIKI
   unchanged_
   replace_1ITSELF=REF------itself

The program generates "itself", which is not correct, when we have an animate subject such as "man" or "woman". In these cases, we could mark every noun as "MAS" or "FEM" in the semantic section of each grammar string if it is a human being:

   runa SUBMASSUB man
   WARMI SUBFEMSUB woman

Then, we could expand a rule for masculine and feminine humans as:

   2 >
   NOS ^MAS
   A** KIKI
   unchanged_
   replace_1HIMSELF=REF------himself

CHAPTER 12

rules-12.txt and test-12.txt

Unknown words

We can now expand the LEXICON and USER LEXICON to include all 300 entries of the Simple Translation Language and voilà: the translation is almost ready.

The last thing I would like to show you is how to translate unknown words. No electronic translation program and no dictionary in the world is complete. Almost always, there are words which cannot be translated by simply looking up their meaning in the list. Think about names of persons, locations, chemical formulae and so on.

The Simple Translation Language uses the Spanish vocabulary, where no appropriate words are found. Many words in English and in Spanish have the same origin (Latin, Greek or Arabic), sometimes with a slightly different ending or spelling. To avoid 20000 almost identical entries of the form

   nacion s nation
   localizacion s localization
   vision s vision
   universidad s university

we should define a list of endings, which can be substituted by the program automatically. Remember that no accents are used in the Simple Translation language.

   cion > tion SUB
   ion > ion SUB
   dad > ty SUB

This type of entry are written into an UNKNOWN WORDS section (which must be included just before the first SYNTAX section). Each entry consists in two lines:

   DADES DAD E20
   SUB---SUY#ty
   DAD
   SUB---SUY#ty

So "universidades" is transformed to "university", then the plural "universities" would be formed in the target language by usual inflection rules.

The termination -dades is split into -dad and ending E10 (plural). In the second line, you find the grammr string. The # means here: the translation is lower case; % would mean upper case (for names).

It is also possible to include ambiguous suffixes (here -al may be an adjective or noun).

   AL
   A/S*ADJ---ADJ#al*SUB---SUB#al

Often, verbs of the type -ar are transformed to English -ate, e.g. "communicate", "terminate", "congratulate". The rules would include all possible inflectional forms, that is with tense markers. The longest entries come first:

   ARQAN A E12
   VER---VEE#ate
   ANQAN A E11
   VER---VEE#ate
   APTIN A E13
   VER---VEE#ate
   AN A E10
   VER---VEE#ate

In rules-12.txt, you will see some other transformations. Unknown words in test-12.txt are "universidad", "inteligente", "diccionario", "comunicar". They are tagged by Nilix automatically by a following _UU symbol, which does not interest here, so we can delete it in an IDIOMS section

   #_UU
   E

Now, you could try to translate the two sample sentences. Of course, the conversion of unknown words is not always perfect, e.g. "diccionary" should be "dictionary" and "comunicate" should be "communicate" and "inteligent" should be written with double "l".

   Nilix translator 1.6 (2013/08/12 12:47:53)
   Copyright © 1983-2013 Dr. med. O'Niel Som
   www.nili.com * www.nili.de
   
   kay universidades qa chaypi kan.
   The universities are here. 
   
   inteligentem runakuna qa diccionario naq mana comunicarqan.
   Inteligent men didn't comunicate without diccionary.

Chapter 13

rules-13.txt and test-13.txt

More natural grammar

For a Quechua speaker, the phrases presented so far may seem quite strange. I decided to make the constructed Simple Translation Language more natural.

There are only four rules that will be changed.

1. verbs

So far, we have the verb only in the original third person singular form. The whole paradigm of personal inflection is:

   -ni      I
   -nki     you (singular)
   -n       he/she/it
   -niku    we
   -nchik	 we all, one
   -nkichik you (plural)
   -nku     they

and so on for past tense, future tense and conditional mood. As the information contained in these suffixes is not used (because an obligatory subject must be present in each phrase), we can substitute them all by a single ending. COE -ni E10 present -nki E10 present -n E10 present -niku E10 present -nchik E10 present -nkichik E10 present -nku E10 present In the file rules-13.txt, you will see a lot of other possible forms for past, future and conditional tense. Besides the ending itself, we should include these endings in the section for unknown words so that forms like "conversarqaku" can be recognized as verbs in past tense.

2. adjectives

Adjectives receive the ending -m only in predicate use as "the tree is small". Whenever an adjective comes in front of a substantive, it will have no ending like in "the small tree is here".

   kay sacha qa uchum kan.
   the tree is small.
   
   kay uchu sacha qa kaypi kan.
   the small tree is here.

This leads to an ambivalent use of words which can be used as nouns or adjectives. So we need disambiguation rules. Polyvalent words have the syntactic symbol V/N.

   V/N SUB -> ADJ SUB (001+002) > 001=ADJ
   
   V/N V/N -> ADJ V/N (001+002) > 001=ADJ
   
   V/N E3. -> ADJ E3. (001+002) > 001=ADJ
   
   V/N E0. -> SUB E0. (001+002) > 001=SUB
   
   V/N E1. -> VER E1. (001+002) > 001=VER

3. Articles

Quechua does not use the articles "huk" and "kay". In reality, they are the number "one" and a demonstrative pronoun "this". I want to let it for you to resolve, how we could modify the Simple Translation Language in manner that the definite article "kay" is not used anymore.

Consider the following cases, where English does not use an article:
• noun phrase in plural
• noun phrase with pronoun (e.g. "qam=you")
• noun phrase with proper name (e.g. person, geographic, language)
• noun phrase with number (including "huk" = "one")
• noun phrase with adjectives meaning "much, many" (e.g. "achka")
• noun phrase with possessive (e.g. "ñoqapa=my") or demonstative pronoun (e.g. "chay=this") We have to create a rule, which generates the definite English article whenever none of the cases mentioned above is true.

   1 +
   SUB /#SN* /#NAM /#DE* /#REF /plu /#NUM /ACHKA
   I0DEB------the

4. subject marker "qa" after personal pronoun

In Quechua, the "qa" is a topic marker, and may follow the subject or other parts of the sentence. In the Simple Translation Language, "qa" is a necessary subject marker. Now, we will make it optional after personal pronouns. After the rules which recognize C01 ("qa") as the subject marker, we could include this rule.

   SN. -> NO. (001,nom)

Ok, now you can translate the file test-13.txt

Epilogue

Congratulations! You have followed me writing a reasonably complete set of grammar rules for the automatic high quality natural language translation using the software Nilix Translator and the artificial Simple Translation Language. Now it is up to you, to complete the rules and the dictionary and to create rules for other language pairs.

Comments on this text are welcome to . (it will not be possible for me to answer all emails because I work in a completely different area everyday). Please copy this text, if you want, and distribute it to other persons interested in language translation, but mention the source. My address is:

Dr. O'Niel Som
Am Lauersgraben 26c
96450 Coburg
Germany

Nili ist registriertes Markenzeichen von Dr. O'Niel Som. Alle anderen erwähnten Markenzeichen gehören ihren jeweiligen Eigentümern und werden ohne Kennzeichnung gebraucht.