\newcommand{\mod}{\mathop{\rm mod}\nolimits}
\newcommand\dtd{\acro{DTD}}
\newcommand\SGML{\acro{SGML}\xspace}
\newcommand\ISO{\acro{ISO}\xspace}

\title{Standard DTDs and scientific publishing}
\author{N. A. F. M. Poppelier (\texttt{n.poppelier@elsevier.nl}),\\
E. van Herwijnen (\texttt{eric@vanherwijnen.org}), and \\
C.A. Rowley (\texttt{C.A.Rowley@open.ac.uk})}

\date{7 August 1992}

\let\Tub\TUB

\begin{Article} 

\section{Abstract} 
 
 This paper has two parts.  
In the first part we argue that scientific publishing  
needs \textsl{one} standard \dtd{} for each class of documents  
that is published. For example one for all research  
papers and one for all books. In the second part  
we apply this reasoning to mathematical formulas, and  
we outline some design requirements for a document  
type definition for mathematical formulas. In the  
appendices we discuss and compare existing document  
type definitions for mathematical formulas. 
 
\section{Introduction} 
 
In the preface to \cite{one} Charles Goldfarb wrote that the  
Standard Generalized Markup Language can be described 
as many things, and that \SGML is all that -- and more. In  
the introduction to \cite{one} Yuri Rubinsky wrote: 
\begin{quote}
\ISO~8870 never describes \SGML as a meta-language, but  
everything about its system of declarations and notations  
implies that a developer has the tools to build exactly what  
is required to indicate the internal structure of any type of  
information in a common tool independent manner. 
\end{quote}
Indeed, a strong point of \SGML is that it can be regarded as  
a meta-language, a tool with which one can define the syntax  
of many languages, very much similar to context-free grammars.  
In \SGML terminology these `languages' are called \textsl{document type  
definitions}, called \textsl{\dtd{}} for short. \dtd{}s can he written
for any  type of information, research papers, books and music. A
\dtd{} can be used for many purposes, of which two important ones  
are storage and exchange of information coded according to this  
\dtd{}. 
 
The premise of this paper is that the exchange of information,  
if it is based on \SGML, needs a single common \dtd{}, agreed upon  
by all parties involved, for each class of documents that is  
exchanged 
 
Suppose two parties, $A$ and~$B$, exchange information in the  
form of one class of documents. and that they each have a \dtd{},  
$D(A)$ and $D(B)$, with $D(A)$ not identical to $D(B)$. If~$A$ sends a  
document to~$B$ then~$A$ can include the document type
definition $D(A)$. for that document (instance) at the beginning of the  
document. This enables~$B$ to use an \SGML parser to check the  
validity of the document he received. However, there is nothing more~$B$
can do with the document: the \dtd{} $D(A)$ contains no information about the
meaning  of the coding scheme that $D(A)$ defines, and a mapping of the document
from $D(A)$ to $D(B)$ is a  procedure that cannot be automated. The problem
becomes even more difficult when a third party, $C$, is  introduced, who
accepts material from both~$A$ and~$B$. How is~$C$ going to
handle material with two  different coding schemes? 
 
This is where we encounter one of the weaknesses of \SGML \textsl{as it is being
used  currently}, namely that it  enables every party involved in this process to
define and use a different \dtd{}. 
 
\section{Scientific publishing}\label{sci-pub}
 
In the rest of this paper we concentrate on the exchange of information  that
occurs in scientific  publishing, in particular on the exchange of papers that
contain mathematical formulas and are  published in research journals. Recent
developments in this area formed the main reason for writing this  paper. A few
standards for encoding of mathematical formulas have already emerged, of which a
well-known one is the \acro{AAP} Standard or Electronic Manuscript Standard
\cite{two}. A \dtd{} for mathematical  formulas accompanies this
standard, but it is not part of it. Another standard for mathematical
formulas  is the one adopted by CALS \cite{three}, and others are
under development \cite{four},
\cite{five}.
 
The handling of mathematical formulas in scientific publishing is part of the 
bigger whole of  information exchange within a (the) scientific community, with
the publisher as intermediary, as is  shown below:

\begin{picture}(100,80)(-70,0)

\put(40,50){\oval(80,40)}
\put(30,60){$C$}

\put(59,50){\oval(20,10)}
\put(55,46){$P$}
\put(65,50){\vector(-1,-2){20}}

\put(40,10){\oval(20,10)}
\put(36,6){$G$}

\put(34,10){\vector(-1,2){20}}


\end{picture}

\noindent The authors of research
papers are the providers, $P$. The publishers are the gatherers  of information,
$G$. They accept information from many providers, gather this in the form of a
journal  issue, and distribute this. In this process, the publisher provides a
quality check via the system of peer  reviewing, makes notation consistent, and
in some cases improves the prose. The information is  distributed to a group of
consumers, $C$, with the set~$C$ a superset of the set~$P$. In this process, two
sorts  of information can be exchanged: 
\begin{itemize}
\item material that is structured in the sense
of being encoded according to,  and checked against, some formal structural
specification such as a \dtd{}; 
\item material that is not structured.
\end{itemize} 
At present most of the material exchanged in the process of scientific 
publishing is of the unstructured  type. We expect that this will remain the
situation in the near future. As soon as authors get the  possibility of using
more sophisticated tools, we expect that publishers will receive increasing
numbers  of papers of the structured type. 

Several scientific publishers, among whom Elsevier Science Publishers, have 
adopted \SGML as the  future main tool for the process of publishing scientific
articles \cite{six}, and several other publishers have  made, or are
expected to make, the same choice. The European Laboratory for
Particle Physics  (\acro{CERN}), a large community of information providers,
are using \SGML to automate the loading of  bibliographic information
in their library's database \cite{seven}. For both authors and
publishers it would be  advantageous to agree on one \dtd{} for the
encoding of research papers. There are several reasons for  this: 
 \begin{itemize}
\item     Most authors do not submit all their articles to one and the same
publisher every time me. At present  they are confronted with `Instructions to
Authors' that differ significantly from publisher to publisher.       
\item A recent trend is that authors prepare their papers with text-processing
software on some computer.  This enables them to send the paper in electronic
form (electronic manuscript or `compuscript') to the  publisher. Publishers are
confronted with a variety of text-processing software on a variety of computer 
systems \cite{eight}, \cite{nine}. Moreover, every field of science
appears to have its own `Top Ten' of most used text  processing
packages. 
\item      Bibliographic information about all research papers in all (or most)
scientific journals is stored in  bibliographic databases. 
In an ideal world, authors would still be able to use their favourite text-processing system, which would 
generate \SGML `behind the screens', so to speak. All publishers would
accept one standard \dtd{}, and all  text-processing systems would be
able to generate documents prepared according to this \dtd{}, and all 
bibliographic databases would be able to store this material. 
\end{itemize}

An example of activities towards achieving this ideal situation: the  European
Working Group on \SGML  (\acro{EWS}) and the European Physical Society (\acro{EPS}) have taken
the Electronic Manuscript Standard and  are trying to develop it into a complete
\dtd{}, which should be acceptable to information providers,  information gatherers
and information consumers. The Electronic Manuscript Standard is now a Draft 
International Standard, \ISO/\acro{DIS} 12083. The \acro{EWS} and \acro{EPS} hope that the final
standard will include  their work. 
 
\section{Encoding of mathematical formulas} 
 
In Annex A of \ISO~8879~\cite{ten}  we find the following: 
 \begin{quotation}
Generalized markup is based on two novel postulates: 
\begin{itemize}
\item    Markup should describe a document's structure and other
attributes rather than specify processing to  be performed on it, as descriptive
markup need be done only once and will suffice for all future  processing. 
\item     Markup should be rigorous so that the techniques available for
processing rigorously defined  objects like programs and databases can be used
for processing documents as well.  
\end{itemize}
\end{quotation}

There is no reason why this should not be
valid for mathematical formulas. We need to delimit the kind  of mathematical
formulas we are trying to describe if we want an unambiguous structure. The
field of  mathematics is so vast, that it may be impossible to design a single
\dtd{} that covers every kind of  mathematical formula. If we concentrate on those
sciences which use mathematics as a tool, for  example physics, we see that the
mathematics used in many physics papers can be described as ``advanced 
calculus'' This definition can be made more precise by referring to some standard
textbooks containing  these types of formulas, e.g.\ \textsl{Handbook of
Mathematical Functions} \cite{eleven} and the \textsl{Table of
integrals,  series and products}  \cite{twelve}. 
 
If we aim for rigorous encoding of mathematical formulas (the second postulate), we must develop a 
system of descriptive markup of mathematical formulas that enables us to: 
\begin{itemize}
\item convert the formulas between different word processors;       
\item store the formulas in and extract them from a database;       
\item allow programs to input or output formulas in descriptive markup.
\end{itemize} 
An example of the first application would be the conversion of mathematical 
formulas coded in \LaTeX\  to, say, Word\footnote{Word is a registered
trademark of MicroSoft.} via \SGML. The benefits of using \SGML as an intermediate
language for conversion are  described  in  \cite{thirteen}. Note,
for example, that the number of programs required for pairwise 
conversion between~$n$ languages is proportional to $n^2-n$ without
an intermediate language, but to
$2n$  with an intermediate language. 
 
An example of the second application would be encoding and storing the complete 
contents of the  above mentioned \textsl{Handbook of Mathematical Functions}
\cite{eleven} and \textsl{Table of integrals, series and products} 
\cite{twelve} in a database, so that this information can be accessed
on-line by, say, mathematicians and  physicists. Many articles have
mathematical formulas in their titles, so any program that extracts 
bibliographic data should be able to handle mathematics as well. 
 
An example of the third application would be the extraction and subsequent use 
in a computer program,  written in an ordinary programming language or, for
example, in Mathematica.\footnote{Mathematica is a registered trademark of
Wolfram Research.} 
 
At this point we come back to the ideal world for scientific publishing we 
sketched earlier. In this  world, publishers would use one standard \dtd{} for
scientific papers, which enables them to prepare a  primary publication -- in
paper and (or) in some electronic form - and to store the information in 
databases for various secondary purposes. 
 
The question now is: what should a \dtd{} for mathematical formulas look like,  if
it is going to be used for  these purposes? 
 
There are two choices for a \dtd{} for mathematics: 
\begin{itemize}
\item P-type: the \dtd{} reflects the Presentation or visual structure; examples
of this type are discussed in  the appendices. 
\item S-type: the \dtd{} reflects the Semantics or logical structure; at present no
\dtd{}s of this type exist. 
\end{itemize}
The  quotation from Annex~A of \ISO~8879 \cite{ten} indicates
the preference of the creator(s) of \SGML: markup  of a formula should be of
S-type, it should describe the logical structure of the formula, rather than
the  way it is represented on a certain medium, say the page of a traditional
(non-electronic) book. 
 
Let us suppose, for the sake of the argument, that an information gatherer,  a
publisher, chooses a \dtd{} of  S-type. This raises two further questions:
\begin{enumerate} 
\item Is descriptive markup of mathematical material possible? 
\item If it is possible, who can use it and  for which purposes? 
\end{enumerate}
The second question needs some explanation. As discussed in section
\ref{sci-pub},  in the process of scientific  publishing two sorts of 
information can be exchanged. mathematical material that is structured 
according  to a formal structural specification, and material that is not
structured. This means that there are two  possible scenarios. 
 
Scenario 1: an author submits a paper in the form of a manuscript
(paper), i.e.\ with unstructured  formulas, or a compuscript with
mathematical formulas in P-type notation (\TeX, WordPerfect, \dots). 
 
Scenario 2: an author submits a paper with mathematical formulas in  S-type
notation. In scenario 1 it is  the task of the publisher to convert from paper
or P-type notation to S-type notation. Before we discuss  the feasibility of
this conversion, we will first look at some characteristics of mathematical
notation. 
 
\subsection{Characteristics of mathematical notation}\label{character} 
 
Mathematical notation is designed to create the correct ideas in the  mind of
the reader. It is \textsl{deliberately}  ambiguous and incomplete: indeed, it is
almost meaningless to all other readers. Or, more technically:  the intrinsic
information content of any mathematical formula is very low. A formula gets its
meaning,  i.e. its information content, only when used to communicate between
two minds which share a large  collection of concepts and assumptions, together
with an agreed language for communicating the  associated ideas. 
 
The ambiguity encountered in mathematical notation can be of two types
\cite{fourteen}
\begin{enumerate}
\item   A generic notation uses the same symbols to
represent similar but different functions, for example  `$+$' or `$\times$'. In
the case of addition this is not really a problem, but multiplication is a
problem since,  multiplication of numbers is commutative, whereas matrix
multiplication is non-commutative! 
\item A  more fundamental ambiguity is posed
by the same notation being used in different fields in different  ways. For
example: $f'$ stands for the first derivative of~$f$ in calculus, but can mean
`any other entity  different from $f$' in other areas. 
\end{enumerate}
 
More examples of ambiguity are:
\begin{itemize}
\item       Does~$\bar x$ represent a mean, a conjugation or a negation?  
\item     Is~$i$ an integer variable, e.g.\ the index of  a matrix, or is it
$\sqrt{-1}$?     
\item The other way around: is $\sqrt{-1}$ denoted by~$i$ or
by~$j$?\footnote{There are examples of authors actually writing something like 
$[L_i,L_j] =\frac{i}{2}L_k$, where the first~$i$ is an 
index, and the second~$i$ stands for~$\sqrt{-1}$.} 
 
\item    What is the function of the~2 in $\textrm{SU}_2$ $\log_2x$, $x^2$,
$T_2^2$?\footnote{In $\textrm{SU}_2$ it is the number of dimensions of the Lie
group; in $\log_2x$ it is the base of the logarithm; if~$x$  is a vector, the~${}_2$
in~$x_2$ is an index: the~${}^2$ in~$x^2$ could be a power, but if~$T$ is
a tensor, the~${}^2$ in~$T^2_2$ is a  contrainvariant tensor index.} 
\item Is $|X|$   the absolute value of a real (complex) number~$X$
or the polyhedron of a simplicial  complex~$X$ \cite{fifteen}?  
\end{itemize}
The inverse problem, which is equally common, arises when different typographical
constructs have the  same mathematical meaning. For example, the meanings of
both the following two lines would be  coded identically 
\begin{eqnarray*} 
3 &+& 4 (\mod 5)\\
3 &+_5& 4 
\end{eqnarray*} 
and this would lead to great difficulty if an author wanted to write: 
\begin{quote}
We shall often write, for example, $3 + 4 (\mod 5)$ in the shorter form $3 +_5
4$, or even as simply $3+4$  when this will not lead to confusion. 
\end{quote}
 
 
Of course, natural languages are similarly ambiguous and incomplete,  but no one
we know is suggesting  that in an \SGML document each word should be coded such
that it reflects the full dictionary definition  of the meaning which that
particular use of the word is intended to have! 
 
\subsection{Who performs the markup of math?} 
How does one convert P-type mathematical material, which an author has 
produced, to S-type notation,  which the publisher uses? 
In \cite{one}, (p.9) Goldfarb gives a three-step model for document
processing: 
\begin{enumerate}
\item recognition of part of a document (adding a generic identifier
for the appropriate element);\label{first}
\item      mapping (associating a processing function
with each element);\label{second} 
\item     processing (e.g.\ translating  elements into word
processor commands).\label{third} 
\end{enumerate}
 In the publishing of scientific papers and books steps~\ref{second}
and~\ref{third} are the responsibility of the publisher. 
Traditionally, step~\ref{first} was also their responsibility: the
technical editor adds markup signs in the margin  of the manuscript,
depending on the text and the visual representation that the house
style dictates. It is,  however, unlikely that a technical editor is
capable of identifying the precise function of every part of a  
mathematical  formula,  for  several  reasons,  most  of which  were 
discussed  in  the  previous   subsection, namely that mathematical
notation:    
\begin{itemize}
\item      is not unambiguous,    
\item     is not completely  standardized,   
\item      is not a closed system. 
\end{itemize}
 Even if the technical editor were capable of identifying every
part of a formula, this would be too time- consuming -- and therefore too costly.
However, under certain conditions \cite{sixteen}, automatic
translation  from visual structure to logical structure of
mathematical material is simplified greatly. 
 
This, and what we discussed in section~\ref{character}, leads us  to
conclude the following. A publisher has no  choice but to use a
P-type \dtd{} for mathematical material that is submitted in
unstructured form or in P-type notation. Even if S-type markup of a
mathematical formula would be possible, conversion from P-type to
S-type would be difficult or even impossible. Conclusion: the tags
for S-type markup should not  be added by the information gatherer,
but by the information providers, i.e. the authors, who should be 
able to identify each part of their formulas. 
 
\subsection{Feasibility of S-type notation}
 In our second scenario, authors
would submit papers with  mathematical formulas in S-type notation. This would
enable the publisher to `down translate'\footnote{`Down' because information is
lost in the process; we borrowed the terminology of translating `up'  and
`down'  from Exoterica OmniMark.} to any  mathematics typesetting language
(P-type notation). However, the same reasoning as in section 3.1  leads us to
the following conjecture: 
 
Conjecture. It is impossible to create an S-type \dtd{} for all of mathematics.
 
Representing the ``full meaning'' of a mathematical formula, if such a notion
exists, will almost certainly  lead to attempts to pack more and more
unnecessary information into the representation until it  becomes useless for
any purpose. This is rather like Russell and Whitehead reducing ``simple 
arithmetic'' to logic and taking several pages of symbols to represent the
``true meaning of $2+2=4$''. 
     
Even if it were possible to define an S-type \dtd{} for a certain
branch of mathematics, this still gives problems. Supposing an
S-type \dtd{} contains an element for a ``derivative'' of a function.
Since the S-type \dtd{} will not contain any presentational attributes,
a decision will have to be made to represent the derivative of
$f(x)$ on paper as $f'(x)$ or $\frac{\text{\fontfamily{cmr}\selectfont
    d}f(x)}{\text{\fontfamily{cmr}\selectfont d}x}$.
There are, however, times (such as in this article) that both
representations are required for the same semantic object, and that
the author will need other notation in addition to that defined by
the S-type \dtd{}.

A likely reason for the belief that an S-type \dtd{} is possible, is
that many people in the worlds of document processing or computer
science are convinced that each symbol has at most a few possible
uses and that mathematical notation is as straightforward to analyse
as, for example, a piece of code for a somewhat complicated
programming language. The reality is that mathematical notation is
more akin to natural language: it is ambiguous and incomplete, as we
pointed out earlier.

\subsection{Some problems with existing languages}
To show that it is not obvious to capture mathematical syntax in a
\dtd{}, let alone its semantics, consider the example of a limit
\[
\lim_{x\to a}f(x)
\]
The syntactic structure of a limit is:
\begin{itemize}
\item The limit operator
\item The part containing the variable and its limit value
\item The expression of which the limit is to be taken
\end{itemize}
The first part could:
\begin{itemize}
\item always be ``lim'', in which case it is just a part of the
presentation of the formula and it should be left out.
\item be one of a finite list of alternatives, indicating the type
of limit($\liminf$, $\sup$, $\max$, etc.). In this case it should be
an attribute.
\item be any expression.
\item be any text.
\end{itemize}
We think the second possibility comes closest to the syntax of the
limit construct. Th second and third parts can be any mathematical
expression.

Now let's look at the way this formula is coded with the \dtd{}s from
\ISO \acro{TR}~9573, \acro{AAP} math and Euromath respectively. Using the
mathematics \dtd{} from \ISO \acro{TR}~9573 there are three possibilities:
\begin{itemize}
\item \verb|lim <sub pos=mid> x &rarr; a </sub> f(x)|
\item \verb|<plex><operator>lim</operator><from>x &darr;|
\verb|a</from> <of>f(x)</of></plex>|
\item \verb|<mfn name=lim><sub pos=mid>x &rarr;|\\
\verb|a</ll><opd>f(x)</opd></lim>|
\end{itemize}
whereas with the Euromath \dtd{} we would have:
\begin{verbatim}
<lim.cst><l.part.c limitop=lim><range>
<relation>x\&rarr; a </relation></range>
</l.part.c><r.part.c><textual>f(x)</textual>
\end{verbatim}


We see that the \acro{AAP} and Euromath expressions are closest to the limit  syntax.
The best solution from  \ISO \acro{TR}~9573 involves a more general ``plex''
construct, which can be used for integrals, sums, products,  set
unions, limits and others. When the plex construct contains the
actual lower and upper bounds it  may even give semantic
information.  

Some mathematicians, however, are not satisfied with
this solution \cite{seventeen}. The plex operation is probably  a
notation  for an iterated application of a binary operation (e.g.\
sums and products),  while limits are of  a different nature. In many
cases only the from part  will be used, and there the whole range of
the bound  variable will be indicated, as an interval or a more
general set. How does one go about extracting the  bound variable? 
 
This supports our conjecture from the previous section, namely that it is very
hard to capture the  semantics for all mathematics. it also suggests that some
redundancy is required to select whichever  notation is most appropriate in a
certain context. 
 
\section{Re-using mathematical formulas}
 There are two important uses for a
generically coded mathematical  formula. The first one is in a mathematical
manipulation -- or computer algebra -- system (\acro{MMS}), such  as Mathematica
\cite{eighteen} or Maple \cite{nineteen}. Computer programs for the
numerical evaluation of formulas, for  example written in
\textsc{Fortran} or Modula-2,  can also be regarded as mathematical
manipulation  programs.  

The second form of re-usage is in a mathematical typesetting system, for
formatting the formula on  paper or on screen; examples of this are \TeX\
\cite{twenty} and eqn/troff \cite{twentyone}, \cite{twentytwo}.  

For computer algebra systems the notation for the formula should be such that a
particular type of  manipulation on a particular system is possible, given a
`background' of concepts and assumptions that  enables the system to interpret
the input as a mathematical statement.  

The coding of a formula that is adequate for document formatting, for example the
\TeX\ notation \verb|f^{(2)}(x)|, is very unlikely to contain much of the
information required for a manipulation system to make  use of it. However, for
a limited held of discourse it is feasible to use the same coding for both types
of  system \cite{sixteen}. 
 
Some examples: the square of $\sin x$ is typographically represented as
$\sin^2x$, but a system like  Mathematics or Maple would probably prefer
something like $(\sin x)^2$ as input. Typesetting the inverse  of $\sin x$ as
$\sin^{-1}x$, however, could be confusing: does it mean $1/(\sin x)$ or $\arcsin
x$? 
 
An \acro{MMS} would probably require the second derivative of a function~$f$ with
respect to its argument~$x$ to  be coded as $(D,x)((D,x)f(x)))$ but
on paper this would be represented as $f''(x)$, or $f^{(2)}(x)$, or
$\frac{\displaystyle\text{\fontfamily{cmr}\selectfont d}^2f(x)}%
    {\displaystyle\text{\fontfamily{cmr}\selectfont d}x^2}$.
 
On the output side of a \acro{MMS} there are other problems since some of the coding
necessary for  typographically acceptable output cannot be automatically derived
by the system from the coding used  by the \acro{MMS}.  

The Euromath view \cite{seventeen} is that a common interface should
be designed together with the manufacturer  of a \acro{MMS}. Perhaps an
\acro{MMS}-type \dtd{} will be required. 
 
\section{Related problems}
Another problem is, of course, that mathematics is by its nature extensible, so 
there will always be new types of manipulations to be done. Notations are
changed or new notations are  invented almost every day, figuratively speaking.
Normally these new subjects will use existing  typographic representations, but
the computer algebra system will not know what formatting to use!  Occasionally
a new typographic convention will be needed. And although there is agreement
on the  notation for most mathematical concepts, authors of books on mathematics
tend to introduce alternative  notations, for instance when they feel this is
necessary for didactic reasons. Mathematical notation is  not standardized, and
it is open -- anyone can use it, and add to it, in any way they wish. 

If we consider a given \dtd{} at any time, we have to ask ourselves: can an author
add elements when the  need for this arises? Theoretically the answer is `Yes,
he can' \cite{twentythree}, (p.71), although it is not 
straightforward to include the new elements in the content models of
existing elements.  

Are such modification by the author desirable? A \dtd{} which is locally modified by
an author will quickly  give rise to the situation described in the introduction
to this paper, and this should therefore probably  be  discouraged. Others,
however, have also noticed a need for private elements, as described in \acro{EPSIG} 
News 3, no.~4; one of the challenging aspects of using \SGML being encountered by
the Text Encoding Initiative is that the  guidelines
need to be extensible by researchers. They need to be able to extend
the \dtd{} in a disciplined way.
 
This problem, however, may not be a serious one. The collection of style 
elements is almost a closed  set, since the number of fonts, symbols and ways
to combine them is limited. In fact, most notation is  not syntactically new,
since the limited number of constructs works well as a notation. The multitude
of  notations is obtained by combinations of fonts, symbols and positions (left
or right subscript, left or  right superscript, atop, below, \dots), and by
giving one notation more than one meaning. This again seems to  support our view
that only a P-type \dtd{} can be constructed for \emph{all} of mathematics.

An \SGML  \dtd{}, of whatever type, also doesn't solve the problems of new atomic or
composite symbols, which  occur frequently in mathematics. As with new elements,
an author can add entities for these new  symbols. There is no method to add the
name of a new symbol, whether atomic or composite, to an  existing set of entity
definitions for symbols, other than to contact the owner of the set and wait for
an  update. 
 
Although there is now a standard method to describe that symbol's glyph 
(shape) \cite{twentyfive}, it is not  practical for an author to
include it. A compromise solution seems to be to extend an existing
set, such  as the one from \ISO \cite{twentysix}, as much as
possible, and try to standardize its use. 
 
\section{Conclusions} 
We have argued as follows:
\begin{itemize} 
\item  That a logical \dtd{} in the sense of describing the structure of
the mathematical meaning is as  impossible for maths as it is for natural
language, and also it is useless for formatting since the same  mathematical
structure can be visually represented in many different ways. The correct one
for any  given occurrence of that structure cannot be determined automatically,
but must be specified by the author. 
\item  That what needs to be encoded for formatting purposes, is information that
enables a particular set of  detailed rules for maths typesetting to be applied.
This could he described as a `generic-visual  encoding' or `encoding the logic
of the visual structure'. To establish exactly what these code?, should  
be will require an expert analysis (probably involving expertise from 
mathematicians, particularly  editors, and from typographers aware of the
traditions of mathematical typesetting).      
\item That this is  different to what
needs to be encoded for use in mathematical manipulation software. Since neither
of  these encodings can be deduced automatically from the other, a useful
database will need to store both.  Perhaps a separate \dtd{} will be required to
enable this communication. 
\end{itemize}
Possible solutions are     
\begin{itemize}
\item A \dtd{} based on a hybrid of visual structure and  logical structure    
\item Two \dtd{}s,  one for visual structure and one for logical structure, that
are linked in some fashion    
\item  Two concurrent  \dtd{}s, one for visual structure and one for logical
structure. 
\end{itemize}
 
The simplest solution is probably to have a basic visual structure which is
 described as an \SGML entity,  supplemented with a (redundant) logical
structure, described by a second \SGML entity. This solution  avoids any special
\SGML features and gives the user all flexibility for mixing and matching as
required.  We believe that similar reasoning can be applied to tables and
chemical formulas, where the problem of  separation form from content is just as
complex, or even more. 

\begin{thebibliography}{10}

\bibitem{one}
Charles Goldfarb.
\newblock {\em The {\SGML} Handbook}.
\newblock Oxford University Press, Oxford, 1990.

\bibitem{two}
Standard for electronic manuscript preparation and markup version 2.0.
\newblock Technical Report Z39.59-1988, {\acro{ANSI}/\acro{NISO}}, 1987.

\bibitem{three}
Techniques for using {\SGML}.
\newblock Technical Report 9573, {\ISO}, 1988.

\bibitem{four}
American~Chemical Society.
\newblock {\acro{ACS}} journal \dtd{}.

\bibitem{five}
Bj{\"{o}}rn von Sydow.
\newblock On the \texttt{math} type in {E}uromath.

\bibitem{six}
N.~A. F.~M. Poppelier.
\newblock {\SGML} and {\TeX} in scientific publishing.
\newblock {\em \TUB}, 12:105--109, 1991.

\bibitem{seven}
E.~van Herwijnen, N.~A. F.~M. Poppelier, and J.C. Sens.
\newblock Using the electronic manuscript standard for document conversion.
\newblock {\em EPSIG News}, 1(14), 1992.

\bibitem{eight}
E.~van Herwijnen.
\newblock The use of text interchange standards for submitting physics articles
  to journals.
\newblock {\em Comp. Phys. Comm.}, 57:244--250, 1989.

\bibitem{nine}
E.~van Herwijnen and J.C. Sens.
\newblock Streamlining publishing procedures.
\newblock {\em Europhysics News}, pages 171--174, November 1989.

\bibitem{ten}
Standard generalized markup language ({\SGML}).
\newblock Technical Report 8879, {\ISO}, l986.

\bibitem{eleven}
M.~Abramovitz and I.~Stegun.
\newblock {\em Handbook of mathematical functions}.
\newblock Dover, New York, 1972.

\bibitem{twelve}
I.S. Gradshteyn and I.M. Ryzhik.
\newblock {\em Tables of integrals, series, and products}.
\newblock Academic Press, New York, 1980.

\bibitem{thirteen}
S.A. Mamrak, C.S. O'Connell, and J.~Barnes.
\newblock Technical documentation for the integrated chameleon architecture.
\newblock Technical report, March 1992.

\bibitem{fourteen}
Neil~M. Soiffer.
\newblock {\em The design of a user interface for computer algebra systems}.
\newblock PhD thesis, Computer Science Division ({\acro{EECS}}), University of
  California, Berkeley, 1991.
\newblock Report {\acro{UCB}/\acro{USD}} 91/626.

\bibitem{fifteen}
M.~Nakahara.
\newblock {\em Geometry, Topology and Physics}.
\newblock Adam Hilger, Bristol, 1990.

\bibitem{sixteen}
Dennis~S. Arnon and Sandra~A. Mamra.
\newblock On the logical structure of mathematical notation.
\newblock {\em \TUB}, 12:479--484, 1991.

\bibitem{seventeen}
Bj{\"{o}}rn von Sydow.
\newblock private communication to EvH.

\bibitem{eighteen}
Stephen Wolfram.
\newblock {\em Mathematica: a system for doing mathematics by computer}.
\newblock Addison-Wesley, Reading, 1991.

\bibitem{nineteen}
Bruce~W. Char, Keith~O. Geddes, Gaston~H. Gonnet, and Stephen~M. Watt.
\newblock {\em Maple User's Guide}.
\newblock \acro{WATCOM} Publications Ltd., Waterloo, 1985.

\bibitem{twenty}
Donald~E. Knuth.
\newblock {\em The {\TeX}book}.
\newblock Addison-Wesley, Reading, 1984.

\bibitem{twentyone}
Joseph~E Osanna.
\newblock Nroff/troff.
\newblock In {\em {UNIX} Programmer's Manual (2b)}. Bell Laboratories, 1978.

\bibitem{twentytwo}
Brian~W. Kernighan and Linda Cherry.
\newblock Typesetting mathematics.
\newblock In {\em {UNIX} Programmer's Manual (2b)}. Bell Laboratories, 1978.

\bibitem{twentythree}
E.~van Herwijnen.
\newblock {\em Practical {\SGML}}.
\newblock Kluwer Academic Publishers, Dordrecht, 1990.

\bibitem{twentyfive}
Font information interchange.
\newblock Technical Report 9541, \ISO, 1991.

\bibitem{twentysix}
Information processing -- {\SGML} support facilities -- techniques for using
  {\SGML} -- part 13.
\newblock Technical Report 9573, \ISO, 1991.
\newblock Proposed Draft Technical Report.

\end{thebibliography}

%\begin{tabular}{ll}
%N. A. F. M. Poppelier& E. van Herwijnen, \\
%Elsevier Science Publishers,&CERN,\\
%P.O. Box 2400,&1211-CH,\\
%1000 CK Amsterdam,&Geneva 23,\\ 
%the Netherlands&Switzerland\\
%\texttt{n.poppelier@elsevier.nl}&%???
%\end{tabular}

%\noindent\qquad and\\
%\begin{tabular}{l}
%C.A. Rowley\\\texttt{C.A.Rowley@open.ac.uk}
%\end{tabular}

\end{Article}

\endinput 
\section{References}
 


\end{Article}
\endinput


A   Existing mathematical notations 
 
A.1  Comparison of existing \dtd{}s 
 
In making comparisons between existing \dtd{}s we shall refer often to what is probably the best-known 
system for coding mathematical notation in documents. This is the version of TEX coding used in 
LaTeX 127] (which differs little from Knuth's Plain T~ notation described in [201), now a de facto 
standard in many areas. It is a mixture of visual and logical tagging, with a bias towards the visual 
which probably results from reasoning similar to that in this paper. 
 
The following document type definitions for mathematical formulas were investigated for this paper: 
AAP 128], ISO [29] and Euromath [51. 
 
We will try to give a few general characteristics of each of them: 
 
AAP This \dtd{} shows a hybrid of visual and logical tagging. It is quite similar to the mathematical 
notation of TEX 120]. 
Integrals, sums and similar constructions have sub-elements tagged explicitly as lower limit, upper limit 
and integrand (summand,...). 
 
The same goes for fractions, roots, and limit-like constructions. 
 
All rectangular schemes of mathematical expressions, e.g.\ matrices and determinants, are tagged as 
'array in this \dtd{}. The delimiters are not part of the construction, although matrices are usually indicated 
by ( ) or as C ], and determinants as I   ( Alignment of rows, columns and cells is indicated by attributes, 
even though they have nothing to do with function, but are in fact processing information. This idea 
also appears in the array notation of LaTeX~[27].
