The Linux Cyrillic HOWTO: Editing text

4. Editing text

In this section I'll describe how to customize various text editors to work with Cyrillic text. This doesn't cover the word processors, which will be described later (see section <@@ref>word-processingword-processing ).

4.1 Emacs and XEmacs

There are two version of the Emacs editor - GNU Emacs and XEmacs. While they provide more or less same functionality, some implementation details are significantly different. Cyrillic setup requires some low-level (in Emacs Lisp sense) tweaking, and it differs a bit for those two versions.

NOTE: Apart from the setup described here, there is an alternative way to configure both versions of emacs - use MULE (MULtilanguage Emacs support). The latter way is fairly complicated and (to the best of my knowledge) rarely used, so I don't discuss it here.

The minimal cyrillic support in GNU emacs (you don't have to do it for the XEmacs) is done by adding the following calls to one's .emacs (provided that the Cyrillic character set support is installed for console or X respectively):

(standard-display-european t)

(set-input-mode (car (current-input-mode))
   (nth 1 (current-input-mode))
   0)

This allows the user to view and input documents in Russian.

However, it isn't enough. Emacs doesn't know yet, that Cyrililic characters may constitute a word, let alon the upper/lower case conversion rules. In order to teach Emacs doing that, you have to modify the syntax and case tables of emacs:

(require 'case-table)

(let* ((ruc "\341\342\367\347\344\345\263\366\372\351\352\353\354\355\356\357\360\362\363\364\365\346\350\343\376\373\375\370\371\377\374\340\361")
       (rlc "\301\302\327\307\304\305\243\326\332\311\312\313\314\315\316\317\320\322\323\324\325\306\310\303\336\333\335\330\331\337\334\300\321")
       (i 0)
       (len (length ruc)))
  (while (< i len)
    (modify-syntax-entry (elt ruc i) "w  ")
    (modify-syntax-entry (elt rlc i) "w  ")
    (set-case-syntax-pair (elt ruc i) (elt rlc i) (standard-case-table))
    (setq i (+ i 1))))

For this purpose I created a rusup.el file which does this, as well as a couple handy functions. You have to load it in your ~/.emacs.

Finally, the russian.el package by Valery Alexeev (valery@math.uga.edu) allows the user to switch between cyrillic and regular input mode and to translate the contents of a buffer from one Cyrillic coding standard to another (which is especially useful while reading the texts imported from MS-DOS or Windows).

4.2 Using vi

The vi editor (at least it's clone vim, available in most Linux distributions) is aware of 8-bit characters. It will allow you to enter cyrillic characters and will be able to recognize the word boundaries correctly. I don't know about the upper-/lower-case conversion rules, since I don't use vi much. If you know something about it, please inform me.

4.3 Editing text with joe

Joe requires a special -asis option to recognize 8-bit characters. You may either specify this option at the command line, or to put it in ~/.joerc file (for personal use, or in /usr/lib/joerc for system-wide setup.

If your program doesn't understand -asis option, you have to upgrade to the newer version.

However, joe doesn't seem to understand the cyrillic words' boundaries correctly. I assume, that it applies both to the case conversion rules.

4.4 Spell-checking Russian

The program I use to spell-check text is the GNU ispell. It is very flexible and extensible, so it is possible to use it to spell-check text in languages, other than English, by adding new spell dictionaries.

Constantine Knizhnik has created a very good Russian dictionary for ispell. You may find it at his homepage. The distribution includes a handy incremental spelling script for emacs.

Ideally, if you already have an ispell properly installed, you have to just step into the newly-created directory and generate the dictionary, using the commands provided in the Makefile. However, chances are quite high, that you'll see a lot of complaints about the ispell's unawareness of the 8-bit data. This is because in most distributions, ispell is compiled without 8-bit data support. In this case, you cannot avoid recompiling the ispell package.

Again, RedHat users will be delighted to know that I've rebuilt the ispell package with both Russian and German dictionaries. As usual, you may grab it from the RedHat FTP site.

Once you have everything installed, you may invoke Russian spell-check, by supplying '-d russian' option to ispell.

Now, if you use Emacs, you may want to add a menu item for a russian dictionary. I sent a proposed menu entry to the ispell.el maintainer and he kindly agreed to include it in the the next public release of the file. Meanwhile, you may do it by adding the following code in your ~/.emacs (or in /usr/share/emacs/site-lisp/site-start.el for a system-wide setup):

(setq ispell-dictionary-alist
  (append ispell-dictionary-alist
          '(("russian"
             "[\341\342\367\347\344\345\263\366\372\351\352\353\354\355\356\357\360\362\363\364\365\346\350\343\376\373\375\370\371\377\374\340\361\301\302\327\307\304\305\243\326\332\311\312\313\314\315\316\317\320\322\323\324\325\306\310\303\336\333\335\330\331\337\334\300\321]"
             "[^\341\342\367\347\344\345\263\366\372\351\352\353\354\355\356\357\360\362\363\364\365\346\350\343\376\373\375\370\371\377\374\340\361\301\302\327\307\304\305\243\326\332\311\312\313\314\315\316\317\320\322\323\324\325\306\310\303\336\333\335\330\331\337\334\300\321]"
             "[']" t ("-C" "-d" "russian") "~latin1"))))
  
(define-key-after ispell-menu-map [ispell-select-russian]
  '("Select Russian (KOI-8)" . (lambda ()
                                 (interactive)
                                 (ispell-change-dictionary "russian")))
  'british)

Unfortunately, it won't work for the XEmacs. I'll try to solve this problem later.