FIELD MARKS FOR WEBSTER 1913 and CIDE
=====================================
* Overview
This file describes the tags used to mark the Webster 1913 dictionary and
the GCIDE (GNU Collaborative International Dictionary of English).
If any tag is not listed here, it is either (1) one of the "point" (font
size) or "type" (font style) tags, which should be self-explanatory; or (2)
is a functional field with no effect on the typography.
Last modified March 12, 1999.
For questions, contact:
Patrick Cassidy cassidy@micra.com
735 Belvidere Ave.
Plainfield, NJ 07062
(908) 561-3416 or (908) 668-5252
A separate file, webfont.txt, contains the list of the individual
non-ASCII characters represented by either higher-order hexadecimal
character marks (e.g., \'94, for o-umlaut) or by entity tags (e.g.,
.
The tags on this list are similar in structure to SGML tags. Each tag on
this list marks a field; each field opens with a tagname between angle
brackets thus: , and closes with a similar tag containing the
forward slash thus: . No tags are used without closing tags.
Thus a line break (similar to HTML tag) is symbolized here as an
entity, has a corresponding
.
The absence of an end-field tag, or the presence of an end-field tag without
a prior begin-field tag constitutes a typographical error, of which there
may be a significant number. Any errors detected should be brought to the
attention of PJC or the appropriate editor.
Most of the tagged fields are presented in the text in italic type, with a
number of exceptions. Where a word is contained within more than one field,
the innermost field determines the font to be used. Wherever recognizable
functional fields were found, an attempt was made to tag the field with a
functional mark, but in many cases, words were italicised only to represent
the word itself as a discourse entity, and in some such cases, the "italic"
mark was used, implying nothing regarding functionality of the word.
The base font is considered "plain". Where an italic field is indicated,
parentheses or brackets within the field are not italicised.
Where no font is specified for a tag, the tag is merely a functional
division, and was printed in plain font unless otherwise tagged. This type
of segment is marked by an asterisk (*) where the font name would be. The
size of the "plain" font in the original text is about 1.6 mm for the height
of capitalized letters.
* Explicit typographical tags
These were used where the purpose of a different font was merely to
distinguish a word from the body of the text, and no explicit functional tag
seemed apropriate.
-------------------------------------------------------------------------
Tag Font Description
-------------------------------------------------------------------------
plain font that used in the body of a definition -- normally
not marked, except within fields of a different
front.
italic in master files
italic for use in HTML presentation
bold in master files
bold for use in HTML presentation
bold, Collocation font. Same font as used in
collocations.
smaller This is used only in the list of "un-"
by 1 point words not actually defined in the
dictionary.
Probably could be replaced by a segment mark
for the entire list! The "un-" words should
be indexed as headwords.
bold Same as , a font similar to that used
in collocations. However, this tag is used
in a table and could be set to a different
font.
* HTML tag -- largest heading font.
* HTML tag -- second largest heading font.
* Marks a Row title in a table.
Font the same as the headword , though
the field is not a headword. Used only
once.
* Multiple items, a set of items in a table.
A series of point size markers, many
unique.
* One of the tags of the form where **
represents the typographic point size of the
enclosed text.
An HTML tag indicating that the enclosed
text is of teletype form, preformatted in a
uniform-spaced font.
small caps used mostly for "a. d.", "b. c."
This is the same font as in , but has no
functional or semantic significance.
group of table data elements in a table.
subscript
subscript
superscript
superscript
Sans-serif
Bold collocation font, and also a subtype.
HTML tage -- teletype font
A squared bold font without serifs approximating
the "universe bold" font on the HP Laserjet4,
slightly larger than the capitals in a definition
body. Used in expositions describing shapes,
such as "Y", "T", "U", "X", "V", "F".
Vertically organized column.
Vertically organized column -- only part of a table
which needs to be completed. Used once.
<...type> A series of tags, many unique, designating
certain unusual fonts, such as "bourgeoistype"
for "bourgeois type", in the section on
typography. Most of these occur only once, in
the section on fonts. Some examples follow:
* Tags with semantic content:
-------------------------------------------------------------------------
Tag Font Meaning and Description
-------------------------------------------------------------------------
* Alternative spelling segment. Almost always
contained within square brackets after the main
definition segment. Expository words such as
"Spelled also" are in plain font; the actual
alternative spelling is marked by ...
tags within this segment.
italic Antonym.
italic Alternative spelling. The actual word which is
an alternative spelling to the headword. These
are functionally synonyms of the headword. In
most cases these also occur as headwords, with
reference to the word where the actual definition
is found, but not all such words are listed
separately, particularly if the spelling is close
enough to the headword to be found at the same
point in the dictionary. Whether listed
separately or not, these words should be indexed
at this location, also.
italic Authority or author. Used where an authority is
given for a definition, and also used for the
author, where a quotation within double quotes is
given in the same paragraph as the definition.
The double quotes are indicated by the open-quote
(\'bd) and close-quote (\'b8). In both cases, it
is typically right-justified, almost always
fitting on the same line with the last line of
the definition or quotation.
Within collocation segments, it is usually used
only after quotations, and is not
right-justified, except occasionally where it
would be close to the right margin, and then
apparently is is right-justified. We have not
explicitly marked those which are
right-justified, but they can be recognized
because they are on a line by themselves,
preceded by two carriage returns.
* Marks a biography. Should be longer than a short
mention of who a person was, which is typically
included as a definition.
* Same as italic Marks the name of a book, pamphlet, or similar
document.
* A field of knowledge which of which the headword
is a division.
* Caption of a figure or table.
* tags the CAS (Chemical Abstracts Service)
registry number for a chemical substance.
italic tags the infectious disease caused by the
headword. Implied type of the agent is a
microorganism, and the tag must mark a disease.
* Same as without the italic type.
* Same as without the italic type.
italic inverse of : tags the causative agent of
an infectious disease, which is the headword.
The tag must mark a microorganism, virus, or
prion, and the implied type of the headword is a
disease.
Used only for the single letter in the headers to
each letter of the alphabet.
* marks the proper name of a city. Used only
occasionally and not consistently at this stage.
italic Converted to: used to tag substances which are
products prepared by conversion from the
headword. Usually chemicals or complex products
from natuarl materials. Rarely used up to 1998.
* List of heads for the columns of a table.
* Title of a column in a table.
* Comment -- differs from in being in-line
with the definition paragraph. Provides a little
additional information.
* Name of a company (commercial firm). Compare
.
italic Composed of. Tags a substance of which the
headword is at least partly composed. The
substance may be particulate, such as diatoms
composing diatomaceous earth.
* marks an object contained within the headword.
italic Contrasting word. Not exactly an antonym, which
is marked , but a contrasting word which is
often introduced as "opposite to" or "contrasts
with".
* Name of a country (nation) of the world.
italic Collocation reference. A reference to a
collocation. Each such collocation should have
its own entry, marked by
... tags,
and these references should function as hypertext
buttons to access that entry.
* A Date, of any type, e.g. Dec. 25.
* Date-with-year tags a date containing a year.
* A definition. The definition may have subfields,
particularly (an illustrative phrase
starting with "as" or "thus" and containing the
headword (or a morphological derivative). The
, \'bd...\'b8 quotations (left and right
double quotes) and fields may be found
within a definition field, but should and usually
are located outside the definition proper. The
marking macro was inconsistent in this placement,
and the exclusion of the , and
quotations needs to be completed by the
proof-readers.
Certain definitions contain fields within
them, where the headword is an irregular
derivative of another headword. In these cases,
the field follows immediately after the
tag, and these entries do not have a
separate field. In such cases, the
field is italic, as usual.
* Division of the headword, usually an
organization. E. g. a faculty or department of a
university, or a United Nations agency.
* Marks an education institution, a subtype of
organization.
* Tags a physical object or form of radiation
emitted by the headword.
Just a place-holder for illustrations, but seldom
used.
italic Marks the name of a movie film.
italic Field of specialization. Most often used for
Zoology and Botany, but many "fields of
specialization" are marked for technical terms.
The parentheses are usually within this field,
but are not themselves in italics.
* Name of a geograpahical region of any size; if
applicable, the more specific , , or
are preferred.
* Hyperym. Points to the hypernym from WordNet 1.5
Initially, used only for entries extracted from
WordNet 1.5. Not present in the original 1913
version.
* Illustrative usage -- mostly from WordNet, and
placed outside the definition, in contrast to
usage. These should be converted to
... illustrative usage format for
consistency.
* Illustration place-holder. Seldom used.
* HTML usage -- points to an image file, usually
.gif or .jpg. These have no closing tag, and
will appear as errors in parsing.
* Points to a word whose meaning is an intensified
form of the headword. Taken from WordNet tags,
used with some adjectives from WordNet.
* Designates one item in a row of a table. Used
only when intervening spaces do not serve
properly as natural field separaters.
italic Translation into a foreign (non-English) language
of the previous word in the text -- italic font.
( is a translation into English)
italic Same as * Title of a journal (periodical).
* Always a filled rectangular array.
* A 2x5 matrix (2 rows by 5 columns).
* Multiple synonymous subtypes -- used in def. of
"grass".
* Multiple table, encloses
figures.
* Music figure. Only in a note under the entry
"Figure", the two numbers of each such field are
bold, 20 point type, stacked as in a fraction
with a bar between them, but also having a
horizontal stroke midway through each
numeral. Unique to this entry.
* Paragraph tag, used always in pairs. Line breaks
may be embedded inside the paragraphs.
* Marks the proper name of a person. Used only
occasionally, but should be used more frequently
for cases where first names are abbreviated, to
reduce ambiguity of the period for automatic
analysis. Where a title is given, prefixed or
postfixed, it is included in this tag.
* Marks the name of a person, when only one name
(usually the last name) is given. Not used
consistently where it should be.
* Marks the name of a publication other than book,
which is marked by . It is often a
magazine or journal.
* Tags the name of a person who is speaking, within
a quotation.
Same as * Collocation, plain text -- used to tag phrases
that should be parsed as a unit, but has no
typographical significance.
italic Always right-justified, as described for .
* A reference to a word in the vocabulary.
* Marks the set of references used for a longer
article such as a biography.
* Marks the name of a river -- a proper name.
* Right justified.
* Designates a row in a table.
* Name of a geopolitical state, the first
subdivision of a country. Includes, e.g. Canadian
provinces.
* Lists subtypes of the headword.
* Superscript
* Supra. The two parts of each such field are
stacked, one over the other, *without* a
horizontal bar between (as in a fraction). Used
only in one entry, for a musical notation.
* Always a filled rectangular array, having
and elements.
* Table datum - one cell in a table.
* Table header.
* Tags a commercial Trade name.
* Table title (Larger than normal font).
====================================================================
* Functional Tags
In the table below, font size comparatives are relative to the plain font.
-------------------------------------------------------------------------
Tag Font Meaning and Description
-------------------------------------------------------------------------
<-- --> * Comment, not a tag. These segments should be
deleted from the written or printed text. Page
numbers of the original text are indicated within
such comments; these may be left in, if desired.
* A comment. Used to indicate page numbers in the
public domain version.
italic Tag for abbreviations, when mentioned within the
definition text.
small caps Tags for the actual adjective or adverb
comparatives or superlatives. Should be
indexed. See also conjf (verbs) and decf (nouns).
italic Alternative name. Usually for plants or animals,
but also used for other cases where words are
introduced by "also called", "called also",
"formerly called". These are
functionally *synonyms* for that word-sense.
italic Same as , but the marked word is a
plural form, whereas the headword is singular.
* Adjective morphological segment, primarily the
comparative and superlative forms. The
occasional adverb morphology is also tagged this
way.
* A segment occurring within the definitional
sentence, providing an example of usage of the
headword. Not conceptually a part of the actual
definition.
smaller Collocation definition. Similar in structure to
spacing headword definitions (the field). May
contain an field. Plain type, but with
closer spacing than main definitions.
bold, Collocation. A word combination containing the
smaller by headword (or a morphological derivative).
1 point The collocations do not have an explicitly
marked part of speech.
See also , tagging embedded collocations.
Collocation, no typographic significance. Used
to mark a word combination defined in the
dictionary without affect on font.
small caps The conjugated (non-infinitive) forms of verbs.
imp. & p. p. is common, as well as p. pr. &
vb. n. Irregular variants of these are less
common. Words in this field perhaps should be
indexed.
smaller Collocation segment. The font and size is normal
vertical in a cs, but the spacing between lines is smaller
spacing (0.9 mm between lower-case letters, rather than
1.1 mm in the main body of the definition). For
an on-line dictionary, reproducing this
typography is probably pointless.
small caps Declension form. The actual morphological
variants of nouns or pronouns. Should be
indexed.
* Embedded Collocation. A word combination
containing the headword (or a morphological
derivative, embedded within a definition without
a separate definition of its own. These
collocations should be defined implicitly by the
text of the definition in which they are
embedded. See also
, tagging explicitly
defined collocations.
Bold Entry field. Gives the headword without accent or
syllabication marks, and with special-character
symbols converted to their nearest ASCII
equivalents. Can be used without conversion as
the string that serves as the index word for that
entry.
small caps Entry reference. References to headwords within
the "etymology" section are in small caps. Such
references also occur in the body of definitions,
and in "usage" segments. Such entry references
should function as hypertext buttons to access
that entry.
* Etymology. Always contained within square
brackets. Normal type is used for explanatory
comments, and italics for the actual words
(marked ) considered as etymological
sources.
italic Etymological source. Words from which the
headword was derived, or to which it is related.
The Greek words within an etymology segment are
invariably etymology sources, and should be
marked as such, but are not so marked, even in
the rare cases where the Greek word
transliteration has been written in.
italic Etymological source, being the name of a person
or geographical location which is the eponym for
the concept. This is used to distinguish
eponymous etymologies from others, and can also
be found in the body of a definition or note, not
only in the etymology field. Very few of the
names that should be marked this way have
actually been so marked, as of version 0.51. In
cases where such eponymous names have not yet
been thus marked, they will usually be marked by
, the non-semantic italic-font marker, or,
in etymologies, by .
italic Example. An example of usage of the headword,
usually found within an or segment.
* Frequency of use, ordinal rank. This is used for
WordNet entries, in which the synonyms were
ranked in order of frequency of use. 1
indicates that the headword is the first word on
the list of synonyms.
* First use. A date at or around which the first
use of this word in writing is recorded. Not in
the original 1913 Webster, and usu. taken from a
recent dictionary. Only a few such fields have
been entered as of version 0.41
Greek transliteration. The Greek words have been
transliterated using roman letters. See
chapter "Greek transliteration" in file
"webfont.txt"
bold, A headword. Each main entry begins with the
larger by mark, and ends at the next mark. The main
2 points entries are not otherwise explicitly marked as a
distinctive field. The same word may appear as a
headword several times, usually as different
parts of speech, but sometimes with different
entries as the same part of speech, presumably to
indicate a different etymology. Within the hw
field the heavy accent is represented by double
quote ("), the light accent by open-single-quote
(`), and the short dash separating syllables by
an asterisk (*). A hyphen (-) is used to
represent the hyphen of hyphenated words.
italic, Usage mark. Almost always within square
but brackets, occasionally in parentheses or without
explanatory any bracketing. The most common usage marks,
may be "Obs." = obsolete "R." = rare, "Colloq." =
plain. colloquial, "Prov. Eng." = Provincial England,
etc. are in italics. Some usage notes are also
marked with , but are in plain. For
simplicity, all words in this field may be
italic, until additional explicit marks are
added.
* A usage mark in plain type (not italic). Found
within a definition, when there are more than one
sense-number listed. "Fig." at the head of an
entry is the most common case.
* Multiple collocation. Similar to multiple
headword, when two or more collocations share one
definition; however, the two collocations are
in-line, rather than stacked or justified. There
may be "or" or "and" words (italicised), or an
"etc." (plain type) within this field. In many
cases, the * Multiple headword. This field is used where more
than one headword shares a single definition. In
the dictionary, the (usually) two headwords are
left-justified one below the other in the column,
and are tied together on the right side of the
headwords by a long right curly brace. This
division is strictly functional, for analytical
purposes, and does not affect the typography.
* Noun morphology section. Rarely used, mostly for
irregular personal pronouns.
* Explanatory note. No explicit font is indicated.
These segments may be separate, as in the
separate paragraphs starting * Plural. The "plural" segment starts with a "pl."
which is italicised, but in this segment is not
otherwise marked as italicised. Other words
occurring in this segment are plain type. The
"pl." can be easily explicitly marked if
necessary.
italic Part of speech. Always an abbreviation: e.g.,
n.; v. i.; v. t.; a.; adv.; pron.; prep.
Combinations may occur, as "a. & n.".
* Part of speech, referring to words in
etymologies, normal type. Always an
abbreviation, as in above Combinations may
occur, as "a. or n.".
small caps Plural word. The actual plural form of the word,
found within a segment.
* Pronunciation. The default font is normal, but
many non-ASCII characters are used. The
pronunciation field may have more than one
pronunciation, separated by an " smaller by Quotation. No bracketing quotation marks, though
two points, occasionally \'bd-\'b8 quotations occur within
centered, these quotations. These quotations tend to be
Separate more complete sentences, rather than just
paragraph phrases, such as are contained within quotation
marks within the definition paragraph.
italic, Quotation author. Used only for the
right quotations marked with that are centered in
justified their own paragraphs.
italic Quotation example. An example of usage of the
headword, within quotations marked by ..
tags.
italic Subdefinition, marked (a), (b), (c), etc. These
are finer distinctions of word senses, used
within numbered word-sense (for main entries),
and also used for subdefinitions within
collocation segments, which have no numbering of
senses. The letter is italic, the parentheses
are not. This tag is also used to indicate the
lettered subdefinition when it is referred to at
another point in the text.
italic The name of a ship. Rarely used.
* Singular. Analogous to the segment, but
more rarely used, mostly for Indian tribes, which
are listed in the plural form.
small caps Singular word. The singular form of the
plural-form headword.
bold, Sense number. A headword may have over 20
larger by different sense numbers. Within each numbered
2 points sense there may be lettered sub-senses. See the
(sub-definition) field.