@c Copyright (C) 1992, 2004 Free Software Foundation.
@c This is part of the GNU font utilities manual.
@c For copying conditions, see the file fontutil.texi.

@node File formats
@chapter File formats

@cindex file formats
@cindex auxiliary files
@cindex data files

These programs use various data files to specify font encodings,
auxliary information for a font, and other things.  Some of these data
files are distributed in the directory @file{data}; others must be
constructed on a font-by-font basis.

@vindex FONTUTIL_LIB @r{environment variable}
@cindex data file searching
If the environment variable @code{FONTUTIL_LIB} is set, data files are
looked up along the path it specifies, using the same algorithm as is
used for font searching (@pxref{Font searching}).  Otherwise, the
default path is set in the top-level Makefile.

The following sections (in other chapters of the manual) also describe
file formats:

@itemize @bullet

@item
@ref{BZR files}.

@item
@ref{CCC files}.

@item
@ref{CMI files}.

@item
@ref{IFI files}.

@end itemize

@menu
* File format abbreviations::   The alphabet soup of font formats.
* Common file syntax::          Some elements of auxiliary files are constant.
* Encoding files::              The character code-to-shape mapping.
* Coding scheme map file::      The coding scheme string-to-filename mapping.
@end menu


@node File format abbreviations
@section File format abbreviations

@cindex abbreviations, of file formats
@cindex file format abbreviations
@cindex meaning, of file format abbreviations

For the sake of brevity, we do not spell out every abbreviation
(typically of file format names) in the manual every time we use it.
This section collects and defines all the common abbreviations we use.

@table @asis

@cindex BPL abbrevation
@item BPL
The `Bezier property list' format output by BZRto and read by BPLtoBZR.
This is a transliteration of the binary BZR format into human-readable
(and -editable) text.  @xref{BPL files}.

@cindex BZR abbrevation
@item BZR
The `Bezier' outline format output by Limn and read by BZRto.  We
invented this format ourselves.  @xref{BZR files}.

@cindex CCC abbreviation
@item CCC
The `cookie-cutter character' (er, `composite character construction')
files read by BZRto to add pre-accented and other such characters to a
font.  @xref{CCC files}.

@cindex CMI abbreviation
@item CMI
The `character metric information' files read by Charspace to add side
bearings to a font.  @xref{CMI files}.

@cindex GF abbreviation
@item GF
The `generic font' bitmap format output by Metafont (and by most of
these programs).  See the sources for Metafont or one of the other
@TeX{} font utility programs (GFtoPK, etc.) for the definition.

@cindex DVI abbreviation
@item DVI
The `device independent' format output by @TeX{}, GFtoDVI, etc.  Many
``DVI driver'' programs have been written to translate DVI format to
something that can actually be printed or previewed.  See sources for
@TeX{} or DVItype for the definition.

@cindex EPS abbreviation
@cindex advertising
@item EPS
The `Encapsulated PostScript' format output by many programs, including
Imageto (@pxref{Viewing an image}) and Fontconvert (@pxref{Fontconvert
output options}).  An EPS file differs from a plain PostScript file in
that it contains information about the PostScript image it produces: its
bounding box, for example.  (This information is contained in comments,
since PostScript has no good way to express such information directly.)

@cindex IFI abbreviation
@item IFI
The `image font information' files read by Imageto when making a font
from an image.  @xref{IFI files}.

@cindex GSF abbrevation
@pindex bdftops
@cindex Ghostscript font format
@item GSF
The `Ghostscript font' format output by BZRto and the @file{bdftops}
program in the Ghostscript distribution.  This is nothing more than the
Adobe Type 1 font format, unencrypted.  The Adobe Type 1 format is
defined in a book published by Adobe.  (Many PostScript interpreters
cannot read unencrypted Type 1 fonts, despite the fact that the
definition says encryption is not required.  Ghostscript can read both
encrypted and unencrypted Type 1 fonts.)

@cindex IMG abbreviation
@cindex GEM
@cindex DOS image format
@item IMG
The `image' format used by some GEM (a window system sometimes used
under DOS) programs; specifically, by the program which drives our
scanner.

@cindex MF abbreviation
@cindex Knuth, Donald E.
@item MF
The `Meta-Font' programming language for designing typefaces invented by
Donald Knuth.  His @cite{Metafontbook} is the only manual written to
date (that we know of).

@cindex PBM abbreviation
@cindex Poskanzer, Jef
@item PBM
The `portable bitmap' format used by the PBMplus programs,
Ghostscript, Imageto, etc.  It was invented by Jef Poskanzer (we
believe), the author of PBMplus.

@cindex PFA abbreviation
@item PFA
The `printer font ASCII' format in which Type 1 PostScript fonts are
sometimes distributed.  This format uses the ASCII hexadecimal
characters @samp{0} to @samp{9} and @samp{a} to @samp{f} (and/or
@samp{A} to @samp{F}) to represent an @code{eexec}-encrypted Type 1
font.

@cindex PFB abbreviation
@item PFB
The `printer font binary' format in which Type 1 PostScript fonts are
sometimes distributed.  This format is most commonly used on DOS
systems.  (Personally, we find the existence of this format truly
despicable, as one of the major advantages of PostScript is its being
defined entirely in terms of plain text files (in Level 1 PostScript,
anyway).  Having an unportable binary font format completely defeats
this.)

@cindex PK abbreviation
@cindex Rokicki, Tom
@pindex gftopk
@item PK
The `packed font' bitmap format output by GFtoPK.  PK format has (for
all practical purposes) the same information as GF format, and does a
better job of packing: typically a font in PK format will be one-half to
two-thirds of the size of the same font in GF format.  It was invented
by Tom Rokicki as part of the @TeX{} project.  See the GFtoPK source for
the definition.

@cindex PL abbreviation
@cindex property list format
@cindex editing TFM files
@pindex tftopl
@item PL
The `property list' format output by TFtoPL.  This is a transliteration
of the binary TFM format into human-readable (and -editable) text.  Some
of these programs output a PL file and call PLtoTF to make a TFM from
it.  (For technical reasons it is easier to do this than to output a TFM
file directly.)  See the PLtoTF source for the details.

@cindex TFM abbreviation
@cindex @TeX{} font metric format
@cindex font metrics
@pindex pltotf
@item TFM
The `@TeX{} font metric' format output by Metafont, PLtoTF, and other
programs, and read by @TeX{}.  TFM files include only character
dimension information (widths, heights, depths, and italic corrections),
kerns, ligatures, and font parameters; in particular, there is no
information about the character shapes.  See the @TeX{} or Metafont
source for the definition.

@end table


@node Common file syntax
@section Common file syntax

@cindex syntax, common data file
@cindex common data file syntax
@cindex file syntax, common
@cindex data file syntax, common

Data files read by these programs are text files that share certain
syntax elements:

@itemize @bullet

@cindex comments in data files
@cindex data files, comments in
@item
Comments begin with a @samp{%} character and continue to the end of the
line.  The content of comments is entirely ignored.

@cindex blank lines in data files
@cindex data files, blank lines in
@cindex whitespace characters in data files
@cindex data files, whitespace characters in 
@findex isspace
@item
Blank lines are allowed, and ignored.  Whitespace characters (as
defined by the C facility @code{isspace}) are ignored at the beginning of
a line.

@cindex null byte in data files
@cindex ASCII NUL in data files
@cindex data files, valid characters in
@item
Any character except ASCII NUL---character zero---is acceptable in data
files.  (We would allow NULs, too, at the expense of complicating the
code, if we knew of any useful purpose for them.)

@end itemize

@cindex line length in data files
A line can be as long as you want.


@node  Encoding files
@section Encoding files

@cindex encoding files
@cindex font encoding, files for
@cindex character mapping
@cindex mapping, of characters

The @dfn{encoding} of a font specifies the mapping from character codes
(an integer, typically between zero and 255) to the characters
themselves; e.g., does a character with code 92 wind up printing as a
backslash (as it does under the ASCII encoding) or as a double left
quote (as it does under the most common @TeX{} font encoding)?  Put
another way, the encoding is the arrangement of the characters in the
font.

It is sad but true that no single encoding has been widely adopted, even
for basic text fonts.  (Text fonts and, say, math fonts or symbol fonts
will clearly have different encodings.)  Every typesetting program
and/or font source seems to come up with a new encoding; GNU is no
exception (see below).  Therefore, when you decide on the encoding for
the fonts you create, you should choose whatever is most convenient for
the typesetting programs you intend to run it with.  (Decent typesetting
systems would make it trivial to set font encodings; unfortunately,
almost nothing is decent in that regard!)

@flindex .enc @r{suffix}
The @dfn{encoding file} format we invented is a font-format-independent
representation of an encoding.  Encoding files are ``data files'' which
have the basic syntax elements described above (@pxref{Common file
syntax}).  They are usually named with the extension @code{.enc}.

The first nonblank non-comment line in an encoding file is a string to
put into TFM files as the ``coding scheme'' to describe the encoding;
some common coding schemes are @samp{TeX text}, @samp{TeX math symbol},
@samp{Adobe standard}.  Case is irrelevant; that is, any programs which
use the coding scheme should pay no attention to its case.

Thereafter, each nonblank non-comment line defines the character for the
corresponding code: the first such line defines the character with code
zero, the next with code one, and so on.

Each character consists of a name, optionally followed by ligature
information.  (All fonts using the same encoding should have the same
ligatures, it seems to us.)

@menu
* Character names::             How to write character names.
* Ligature definitions::        How to define ligatures.
* GNU encodings::               Why we invented new encodings for GNU.
@end menu


@node Character names
@subsection Character names

@cindex character names
@cindex names of characters

The @dfn{character name} in an encoding file is an arbitrary sequence of
nonblank characters (except it can't include a @code{%}, since that
starts a comment).  Conventionally, it consists of only lowercase
letters, except where an uppercase letter is actually involved.  (For
example, @code{eacute} is a lowercase @code{e} with an acute accent;
@code{Eacute} is an uppercase @code{E} with an acute accent.

@vindex .notdef
@cindex blank positions in fonts
@cindex undefined characters in fonts
@cindex fonts, undefined characters in
If a character code has no equivalent character in the font, i.e., the
font table has a ``blank spot'', you should use the name @code{.notdef}
for that code.  This is the only name you can usefully give more than
once.  If any other name is used more than once, the results are
undefined.

To avoid unnecessary proliferation of character names, you should use
names from existing @file{.enc} files where possible.  All the
@file{.enc} files we have created are distributed in the @file{data}
directory.


@node Ligature definitions
@subsection Ligature definitions

@cindex ligature definitions
@cindex defining ligatures
@cindex encoding, ligatures in

The ligature information for a character in an encoding file is
optional.  More than one ligature specification may be given.  Each
specification looks like:

@example
lig @var{second-char} =: @var{lig-char}
@end example

This means that a ligature character @var{lig-char} should be present in
the font for the current character (the one being defined on this line
of the encoding file) followed by @var{second-char}.  You give
@var{second-char} and @var{lig-char} as character codes
(@pxref{Specifying character codes}).  For example, in most text
encodings (which involve Latin characters), some variation on the
following line will be present:

@example
f       lig f =: 013  lig i =: 014  lig l =: 015
@end example

This will produce a ligature in the font such that when a typesetting
program sees the two character sequence @samp{ff} in the input, it
replaces those two characters in the output with the single character at
position octal 13 (presumably the `fi' ligature) of the font; when it
sees @samp{fi}, the character at position octal 14 is output; when
it sees @samp{fl}, the character at position octal 15 is output.

@cindex ligatures in Metafont 2
Metafont version 2 allows a more general ligature scheme; if there is a
demand for it, it wouldn't be hard to add.


@node GNU encodings
@subsection GNU encodings

@cindex encodings, GNU
@cindex GNU encodings

@cindex Cork encoding
@cindex PostScript encodings
@cindex Adobe encodings
@cindex @TeX{} encodings
When we started making fonts for the GNU project, we had to decide on
some font encoding.  We hoped to use an existing one, but none that we
found seemed suitable: the @TeX{} font encodings, including the ``Cork
encoding'' described in TUGboat 11#4, lacked many standard PostScript
characters; conversely, the standard PostScript encodings lacked useful
@TeX{} characters.  Since we knew that Ghostscript and @TeX{} would be
the two main applications using the fonts, we thought it unacceptable to
favor one at the expense of the other.

@flindex gnulatin.enc
Therefore, we invented two new encodings.  The first one, ``GNU Latin
text'' (distributed in @file{data/gnulatin.enc}), is based on ISO Latin
1, and is close to a superset of both the basic @TeX{} text encoding and
the Adobe standard text encoding.  We felt it was best to use ISO Latin
1 as the foundation, since some existing systems actually use ISO Latin
1 instead of ASCII.  We also left the first eight positions open, so
particular fonts could add more ligatures or other unusual characters.

@cindex expert encoding
The second, ``GNU Latin text complement'' (distributed in
@file{data/gnulcomp.enc}), includes the remaining pre-accented
characters from the Cork encoding, the PostScript expert encoding, swash
characters, small caps, etc.

@c todo discuss encodings in more detail, show tables, etc.


@node Coding scheme map file
@section Coding scheme map file

@cindex coding scheme mapping
@cindex finding encodings

When a program reads a TFM file, it's given an arbitrary string (at
best) for the coding scheme.  To be useful, it needs to find the
corresponding encoding file.  We couldn't think of any way to name our
@file{.enc} files that would allow the filename to be guessed
automatically.  Therefore, we invented another data file which maps the
TFM coding scheme strings to our @file{.enc} filenames.

@flindex encoding.map
This file is distributed as @file{data/encoding.map}.  @xref{Common file
syntax}, for a description of the common syntax elements.

Each nonblank non-comment line in @file{encoding.map} has two entries:
the first word (contiguous nonblank characters) is the @file{.enc}
filename; the rest of the line, after ignoring whitespace, is the string
in the TFM file.  This should be the same string that appears on the
first line of the @file{.enc} file (@pxref{Encoding files}).

Programs should ignore case when using the coding scheme string.

Here is the coding scheme map file we distribute:

@example
adobestd 	Adobe standard
ascii		ASCII
dvips		dvips
dvips		TeX text + adobestandardencoding
gnulatin	GNU Latin text
gnulcomp 	GNU Latin text complement
psymbol 	PostScript Symbol
texlatin	Extended TeX Latin
textext		TeX text
zdingbat	Zapf Dingbats
@end example