
\MakeShortVerb\|
%
%
\providecommand{\ordinal}[1]{#1}
\providecommand{\PS}{\textsc{PostScript}}
\providecommand{\UKTUG}{\textsc{UKtug}}
\providecommand{\TeXXeT}{\TeX-{}-X\kern-.125em\lower.5ex\hbox{E}\kern-.1667emT}
\providecommand{\WWW}{\textsc{www}}
%
\title{Report of visit to CERN to attend presentation of $\Omega$}
\author[Robin Fairbairns]{Robin Fairbairns\\
  University of Cambridge Computer Laboratory}
\begin{Article}
\section{Introduction}
As you will have seen in earlier \BV{}s of this year, the Francophone
\TeX{} users' group, Gutenberg, arranged a meeting in March at CERN
(Geneva) to `launch' \Om.  \UKTUG{} responded to Gutenberg's plea for
support to enable \TeX{} users from impoverished countries to attend,
by making the first disbursement from \UKTUG's newly-established Cathy
Booth fund.  That money (together with some of the outstanding surplus
from \TeX{}eter '88), was used in the meeting's fund that supported
the attendance of four representatives of CyrTUG (which covers Russian
and other users of the Cyrillic alphabet) and of one representative of
CSTUG (Czech republic and---still---Slovakia).  Apart from these,
there was a large contingent from France, several from Switzerland
(including one German-speaking Swiss and one Englishwoman working in
Lausanne), and one each from Germany, the Netherlands, Spain,
Australia\footnote{Richard Walker, who is currently working in
  Germany} and the UK (me).

%Richard Walker                            richard@cs.anu.edu.au
%Department of Computer Science            Aust:  (06) 249 5689
%The Australian National University        Intl: +61 6 249 5689
%Canberra, ACT 0200, Australia             Fax:  +61 6 249 0010
%His present email addres is: walker@ipd.info.uni-karlsruhe.de

The speakers at the meeting were Michel Goossens (the president of
Gutenburg, as host for Gutenberg and as an expert on background to,
and the use of Unicode), and Yannis Haralambous and John Plaice, \Om's
two developers.

The meeting can be accounted a success; all that attended enjoyed
themselves, and also learnt a lot.  This article is the first of (at
least) two in which I will describe the thinking that led to the
production of \Om, the problems that it addresses and the ways it
solves those problems.

\section{What \emph{is} \Om?}

\Om{} is an extension of \TeX{} and related programs that has been
designed and written by Yannis Haralambous (Lille) and John Plaice
(Universit\'e Laval, Montr\'eal).  It follows on quite naturally from
Yannis' work on exotic languages, which have always seemed to me to be
bedevilled by problems of text encoding.

Simply, \Om{} (the program) is able to read scripts that are encoded
in Unicode (or in some other code that is readily transformable to
Unicode), and then to process them in the same way that \TeX{} does.
Parallel work has defined formats for fonts and other necessary files
to deal with the demands arising from Unicode input, and upgraded
versions of \MF{}, the virtual font utilities, and so on, have been
written.  \Om{} itself is based on the normal |Web2C| distribution
that is at the base of most modern Unix implementations, and of at
least one of the PC versions that is freely available.

\section{Why Unicode?}

Michel explained to us the sorry history of the development of
character sets for use in computing\footnote{This is an area where I
  have some expertise, too, so not all of this comes from Michel}.
There are something between~3000 and~6000 languages in use in the
world, for which a writing system exists.  (The set of languages is
shrinking all the time as the deadening effect of cultural intrusion,
primarily through the electronic media, overwhelms the desire to
support existing cultures to the extent of teaching their language to
the young.)  The distribution of languages is by no means even
throughout the globe (Michel showed us a map), and there are many that
have not been and will presumably now never be formally recorded.

When we come to writing systems, we find almost every variation
imaginable in use somewhere in the world.  The Latin-like system
(written left to right with modest numbers of diacritics simply
arranged) has very wide penetration, not least because so many
languages were first written down by Western European missionaries or
other explorers.  Languages such as Vietnamese are classified as
`complex Latin-like', with $\geq2$ diacritics per character; an
artificial example of the same effect is IPA (the International
Phonetic Alphabet) which has sub- and super-scripts and joining marks.
Languages such as Hebrew and Arabic are written right to left, and
constitute another class.  Then there are the multiple-ligature
writing systems typified by the Indic languages such as Devanagari (of
which we had a fascinating exposition at the 1993 \UKTUG{} Easter
meeting on `non-American' languages, from Dominik Wujastyk), and
finally the syllabic scripts (such as Korean Hangul and Japanese
Hiragana and Katakana), and the ideographic scripts (Chinese and
Japanese Kanji).

Encodings are needed for computer operations on language of any sort.
There are differences between the coded representation and the written
(or printed) representation.  Everyone who's read about \TeX{} at all
will know about ligatures (the CM fonts, and most \PS{} fonts,
implement ligatures so that, for example, `|fl|' typed appears as `fl'
printed).  More significantly, almost all adults in Western cultures
write `joined-up', which is in itself application of a form of
ligature.  All these ligatures are for presentation, not for
information, and so it is unreasonable for them to be represented in a
character set.  Other ligatures, however, form real characters in some
languages (examples are \ae{} in Danish and Norwegian, and \oe{} in
French).

In the dark ages (in fact, as recently as the early 1960s, when I
started computing), every make of computer system had its own
character code, many of them based on the 5-bit teleprinter codes used
in telex printers.  Eventually, the rather more sophisticated
teletypes appeared, which used seven bits of an eight-bit code; this
7-bit codification was standardised as ASCII (the American Standard
Code for Information Interchange), which was (in the area of
application it was designed for) an excellent code.  It had all the
properties needed for many of the significant development of computers
in the 1960s, but it had one serious flaw: it was not able to encode
diacritics, which are used in almost every language (but which your
all-American information interchanger would seldom have a need for).

To regularise the resulting mess, ISO adopted the ASCII standard as
the basis for an international 7-bit character set, ISO~646.  ISO~646
is identical to ASCII in the code points that it specifies; however,
some of the characters that ASCII does specify are left ``for national
variation'' in ISO 646; ASCII itself then became the USA national
variation of ISO 646.  An example of national variation is defined for
the UK, which specifies that the code point that holds `|#|' in ASCII
should hold a pounds sign (\pounds).  There are versions for various
Nordic languages that include characters such as \ae{} or \aa in place
of braces, a version for French with acute, grave and
circumflex-accented letters, one for German that offers umlauts and
`sharp s' (\ss).

There were various attempts at mechanisms to assign different
character sets for use by those who need to use characters from
several different sets (for example, someone writing an
Swedish-English Dictionary); an example is ISO~2022, which defines
escape sequences such switches.  These efforts proved impractical (at
least they seemed so to me), and 8-bit developments of ISO 646 arose,
with the ability (comfortably) to express more than one language.

Thus were born the ISO~8859 character sets.  The commonest of these
(at least in the ken of most English speakers) is ISO~Latin-1
(ISO~8859-1, that is part one of the multi-part standard), which was
designed for use by Western Europeans.  As well as the `basic ASCII
set' in the first 128 characters, it has diphthongs and vowels
appropriate to most Western European languages.  Oddly, it omits the
\oe{} dipthong that French uses, and (perhaps less
surprisingly\footnote{Given that Wales would have been represented by
  the BSI in the standardisation process}) it omits some of the accent
forms used by Welsh.  ISO~8859 didn't stop with part~1, though; there
are variants that accomodate Cyrillic (for Russian, Serbian, and
several other languages of the old Soviet Union), Arabic, Hebrew, and
so on.

This is all well and good, but it doesn't answer the needs of a writer
preparing multilingual documents, except in the case that the multiple
languages are accomodated in the same part of ISO~8859: it will happen
some of the time, but most `interesting' combinations will require
switches of character set whenever the language changes.

So ISO (by this time, jointly with IEC) started development of an
all-encompassing character set, to be numbered ISO/IEC~10646 (the
difference of 10~000 is no accident).  ISO/IEC~10646 was to accomodate
every possible language in the world by the simple expedient of
allowing 32-bit characters.  Of course, no-one can comprehend a 32-bit
character set, and so the set was to be structured, as a hypercube of
different repertoires; the $(0,0,0,0)$ repertoire would be the same
ISO-Latin~1, but all the other sets could be accomodated, too.

Independently, Apple and Microsoft got together to found the Unicode
consortium, whose aim was to define 16-bit characters that would cover
all the economically important world.  This criterion of economic
importance could easily have brought down the whole edifice: the
(increasingly important) languages of the Far East are at best
syllabic (e.g., Korean; Korea claims 11~000 of the code points in
Unicode), or even one character per word (e.g., Chinese; a full
classical Chinese repertoire would require well in excess of 65~536
characters, thus sinking a 16-bit code single-handedly).

Unicode's sponsors therefore enforced a process called `Han
unification', which aims to put the `same' character in any of
Chinese, Japanese and Korean in the same slot in the table.  This
unification is a distinctly dubious exercise: the same character may
have different significance in the different languages, but they are
all represented by the same code point.  Contrariwise, the Latin `H',
the Russian `H' (which sounds as Latin `N') and the Greek `H' (capital
`$\eta$') all get different code points despite having the same paper
representation.  For this reason (among others), there remain doubts
as to whether the Japanese, in important particular, will adopt
Unicode as a long-term replacement for their own national standards.

In the shorter term, however, there remained the possibility that
there would be two conflicting standards for the future of character
codes~--- a \emph{de facto} one (Unicode) and ISO/IEC~10646.  The
ISO/IEC standard reached its (nominal) final ballot without addressing
the relation to Unicode~\dots{}\ but (fortunately) it failed at that
hurdle, and for that reason.  Standards people are notorious for
ignoring the real world\footnote{The author has spent an
  unconscionable long period of his life on these things, and is
  therefore in a position to know}, but this time, they conceded
defeat.  ISO/IEC~10646 was edited to have the whole of Unicode as its
$(0,0,*,*)$ plane, and it has thus passed into the canon of published
standards.

So we may now discuss Unicode without running out against the ISO/IEC
standard: a splendid example of the behaviour known as ``common sense
prevailing''.

\section{Virtual Metafont and Fonts to Support Unicode}

It is known that \TeX{} is a general-purpose programming language.  In
`plain' text, we would type |"hello world"|.  For \TeX{} output we
would type |``hello world''|, which would be transparently
converted to ``hello world''.  Thus, the two grave accents and the two
single quotes constitute `programming'.  In the last analysis, you can
``do everything with \TeX{}''.

When English is typeset, the convention is that the space, after the
full stop is the end of a sentence, is expanded; \TeX{} makes
provision for this to happen by way of the |\sfcode| mechanism.  When
French is typeset, the convention is that the space is not expanded;
the |\sfcode| mechanism can provide this style of typesetting, as
well (cf.~the |\frenchspacing| macro of plain \TeX).

Other features of French typesetting are more difficult to provide in
\TeX{}.  For example, an exclamation mark is separated from the
sentence: ``en fran\c cais\thinspace!''; to program this, the
exclamation mark needs to become an `active character', which is
always a tricky thing to do.

Setting the French quotation marks (known as guillemets) becomes even
more tricky; the guillemets look like little |<<| and |>>|, and the
natural way to program them is by using repeated |<| or |>|
characters; Bernard Gaulle's |french.sty| does this (also setting a
space between the text quoted and the guillemets), but it's becoming
more and more complicated; even more so when we consider the French
rules for quotes within quotes.

More problems arise when we consider the question of diacritics.
English rather infrequently has diacritics, so it's not surprising
that \TeX{}'s method of dealing with them isn't perfect.  To typeset
an accented character, e.g.~\"a, one must type |\"a|; which is typeset
as two little boxes stacked on top of one another, rather like
\shortstack{\fboxsep0.5pt\fbox{..}\\\fboxsep0.5pt\fbox{a}}.  This does
work, but these composite glyphs no longer qualify (to \TeX{}) as
something that it's willing to hyphenate---\TeX{} only hyphenates
`words' made up of sequences of letters.  A language such as German,
with hyphenation suppressed for many words, is hardly a language at
all.  These observations are what led to the definition of the Cork
font encoding, in which a goodly proportion of Western European
letters with diacritics appear as single characters; if they are this
represented, words containing them may be hyphenated.

%\TeX{} is designed (in the last analysis) to typeset English, so that
%typesetting French, German, or other `foreign' languages is a second
%priority; this is the problem \Om{} is attempting to address.

With the Cork encoding, which is in effect an output encoding, we
encounter a further problem relating to the nature of communication.
The problem arises from the nature of character sets; while there are
many well-established character sets, there are seriously different
camps into which they fall.  For example, the character
`{\fontencoding{T1}\selectfont\TH}' (Thorn), appears in Microsoft Windows'
character set but not in the Macintosh set, while `$\Omega$' appears
in the Macintosh set but not in the Windows set; both of these sets
are based on ASCII.  To solve this problem, of encoding all everything
that appears in any character set, there has to be a super-encoding.
This can be either a multi-character representation, as in the \WWW{}
encoding, html (for example the encoding would for \'e would be
|&eacute;|), or a super-character set, as in Unicode.

In the present arrangement of typesetting technology, we have the
situation where non-English users sit at a computer, and express their
own language via a local layer in ASCII or a derivative of it~--- i.e.,
we have a picture like:

\begin{center}
  \input{noinfo.pic}
\end{center}

In this arrangement, the human interface allows the use of local
characters, and the display will show what's typed.  The typography
does the display job again (possibly differently); however,
communication of the text to be typeset is difficult, because of the
local nature of the interface.

The information to be transmitted needs to be encoded.  There is no
limit to the number of local encodings that may exist; equally, there
is no constraint on the representations used by the typographic
system.  However, to facilitate the transmission of information, a
common schema of its representation in the coded date must exist.

\begin{center}
  \input{info-int.pic}
\end{center}

The ultimate mechanism for ensuring that such a schema exists is to
require that everything be transmitted in a common encoding scheme;
\Om{} employs ISO~10646/Unicode for this.  Input text is transformed
into \Om{}'s internal `information' by an Omega Translation
Process~(OTP); OTPs may also be used to transform the information
during its processing withing \Om{}, and an OTP is also used to derive
the coding of the font, to be used for typesetting, from the
Unicode-encoded information within \Om{}:

\begin{center}
  \input{info-ome.pic}
\end{center}

At this point, we're beginning to trespass on the subject matter of
the next article: the internal workings of \Om.  That article is to
appear in the next edition of \BV{}.
\end{Article}
\endinput

\section{Pretty Pictures}

\begin{figure}[htbp]
  \begin{center}
    \leavevmode
    \includegraphics[width=0.75\textwidth]{pics/baseplane.eps}
  \end{center}
  \caption{baseplane.eps}
  \label{fig:baseplane}
\end{figure}

\begin{figure}[htbp]
  \begin{center}
    \leavevmode
    \includegraphics[width=0.75\textwidth]{pics/ucs.eps}
  \end{center}
  \caption{ucs.eps}
  \label{fig:ucs}
\end{figure}

\begin{figure}[htbp]
  \hbox to\textwidth{\hfill
    \subfigure[unicode-home]%
       {\includegraphics[width=0.45\textwidth]{pics/unicode-home.eps}}%
       \label{fig:unicode-home}%
    \hfill
    \subfigure[unicode-resources]%
       {\includegraphics[width=0.45\textwidth]{pics/unicode-resources.eps}}%
       \label{fig:unicode-resources}%
    \hfill}
  \caption{unicode-www}
  \label{fig:unicode-www}
\end{figure}

\begin{figure}[htbp]
  \begin{center}
    \leavevmode
    \includegraphics[width=0.75\textwidth]{pics/uninew.eps}
  \end{center}
  \caption{uninew}
  \label{fig:uninew}
\end{figure}

\begin{figure}[htbp]
  \begin{center}
    \leavevmode
    \includegraphics[height=0.8\textheight]{pics/unicodet.eps}
  \end{center}
  \caption{unicodet}
  \label{fig:unicodet}
\end{figure}

