aboutsummaryrefslogtreecommitdiff
path: root/webfont.txt
diff options
context:
space:
mode:
authorSergey Poznyakoff <gray@gnu.org.ua>2012-02-03 12:48:52 +0200
committerSergey Poznyakoff <gray@gnu.org.ua>2012-02-03 12:48:52 +0200
commitd18a469b7a5a4d4b5da21eab37f34ab1e99a8dce (patch)
tree7eb331e376e85287c25b6a9734dae58a4724da8a /webfont.txt
parent4a458db06b28492a7e48b1a0560b35778e476482 (diff)
downloadgcide-d18a469b7a5a4d4b5da21eab37f34ab1e99a8dce.tar.gz
gcide-d18a469b7a5a4d4b5da21eab37f34ab1e99a8dce.tar.bz2
Revise tagset.txt
* tagset.txt: Review. * README: Reformat. * webfont.txt: Reformat. Document <and/ and <or/.
Diffstat (limited to 'webfont.txt')
-rw-r--r--webfont.txt302
1 files changed, 153 insertions, 149 deletions
diff --git a/webfont.txt b/webfont.txt
index d432fe5..f7423e1 100644
--- a/webfont.txt
+++ b/webfont.txt
@@ -3,163 +3,169 @@
3 3
4* Overview 4* Overview
5 5
6This file describes special symbols and markup entities used in the 6This file describes special symbols and markup entities used in the GNU
7GNU Collaborative International Dictionary of English. 7Collaborative International Dictionary of English.
8 8
9* Introduction 9* Introduction
10 10
11The special characters used in the electronic version of the Webster 11The special characters used in the electronic version of the Webster 1913
121913 are required for visualizing unusual characters used in the 12are required for visualizing unusual characters used in the etymology and
13etymology and pronunciation fields of the dictionary, in a form 13pronunciation fields of the dictionary, in a form comparable to the way they
14comparable to the way they appear in the original. 14appear in the original.
15 15
16The GCIDE markup provides two ways for representing such characters: 16The GCIDE markup provides two ways for representing such characters: using
17using special "escape sequences" and using special markup entities. 17special "escape sequences" and using special markup entities. Historically,
18Historically, "escape sequences" were used to indicate the 18"escape sequences" were used to indicate the character's ordinal position in
19character's ordinal position in a special font, prepared by MICRA, 19a special font, prepared by MICRA, Inc. to represent it on screen. Although
20Inc. to represent it on screen. Although nowadays this method is 20nowadays this method is obsolete, the dictionary corpus still uses these
21obsolete, the dictionary corpus still uses these sequences. This file 21sequences. This file describes their mapping to Unicode characters.
22describes their mapping to Unicode characters.
23 22
24An escape sequence has the form \'xx, where "x" represent lowercase 23An escape sequence has the form \'xx, where "x" represent lowercase
25hexadecimal digits. For example, \'94 stands for "o" with diaeresis. 24hexadecimal digits. For example, \'94 stands for "o" with diaeresis. There
26There are only 256 such sequences. 25are only 256 such sequences.
27 26
28Special markup entities are able to represent a wider range of 27Special markup entities are able to represent a wider range of characters.
29characters. A markup entity is similar to SGML one, but has a 28A markup entity is similar to SGML one, but has a different format. The
30different format. The traditional &xx; format was judged inconvenient 29traditional &xx; format was judged inconvenient because the ampersand is
31because the ampersand is used frequently in the corpus. Instead, 30used frequently in the corpus. Instead, GCIDE entities have the format
32GCIDE entities have the format <WORD/, where "<" and "/" represent the 31<WORD/, where "<" and "/" represent the beginning and end of the entity and
33beginning and end of the entity and WORD represents the character 32WORD represents the character itself. Valid WORDs are in some cases
34itself. Valid WORDs are in some cases abbreviations (for compactness) 33abbreviations (for compactness) of the ISO 8879 recommended symbols.
35of the ISO 8879 recommended symbols. Characters representable by 34Characters representable by escape sequences can also be represented by
36escape sequences can also be represented by entities, but the reverse 35entities, but the reverse is not true, due to a limited range of the former.
37is not true, due to a limited range of the former. 36
38 37The Greek words appearing in the etymologies, when they are included, are
39The Greek words appearing in the etymologies, when they are included, 38typed in a roman-letter transcription, which is described below in chapter
40are typed in a roman-letter transcription, which is described below in 39"Greek transliteration".
41chapter "Greek transliteration".
42 40
43* Unrecognized characters 41* Unrecognized characters
44 42
45Wherever the typists did not know the character to use, they usually 43Wherever the typists did not know the character to use, they usually
46inserted a reverse-video question mark (decimal 176). This appears in 44inserted a reverse-video question mark (decimal 176). This appears in
47full-ASCII versions as <?/. This mark was used both for characters in 45full-ASCII versions as <?/. This mark was used both for characters in
48non-ASCII fonts, and for unreadable characters (i.e., characters 46non-ASCII fonts, and for unreadable characters (i.e., characters smeared in
49smeared in the original or distorted in the copies available to the 47the original or distorted in the copies available to the typists. The type
50typists. The type in the original was in many places smeared and 48in the original was in many places smeared and illegible at the left and
51illegible at the left and right page margins; occasionally, small 49right page margins; occasionally, small parts of words were blotted out by
52parts of words were blotted out by plain white space). 50plain white space).
53 51
54* Italics 52* Italics
55 53
56In most places, italic font is represented by the tags <it>...</it> 54In most places, italic font is represented by the tags <it>...</it>
57surrounding the italic text, or by some other tag which also implies 55surrounding the italic text, or by some other tag which also implies italic
58italic font. In the pronunciations, however, where italicized vowels 56font. In the pronunciations, however, where italicized vowels are used
59are used among non-italic and other special characters to indicate 57among non-italic and other special characters to indicate pronunciation, the
60pronunciation, the special codes <ait/, <eit/, <iit/, <oit/, <uit/, 58special codes <ait/, <eit/, <iit/, <oit/, <uit/, are also used to indicate
61are also used to indicate the italicized vowel. 59the italicized vowel.
62 60
63* Diacritics 61* Diacritics
64 62
65Vowels with a circle above (as in Swedish) are coded <xring/ (x with a 63Vowels with a circle above (as in Swedish) are coded <xring/ (x with a ring,
66ring, or "degrees" mark over it); vowels with tilde over them are 64or "degrees" mark over it); vowels with tilde over them are represented by
67represented by <xtil/, where "x" is the vowel, as in <etil/ (<atil/ 65<xtil/, where "x" is the vowel, as in <etil/ (<atil/ also has code 238);
68also has code 238); letters with a dot above are represented by <xdot/ 66letters with a dot above are represented by <xdot/ -- letter with a dot
69-- letter with a dot below are represented by <xsdot/ ("subdot"); 67below are represented by <xsdot/ ("subdot"); vowels with the semi-long mark
70vowels with the semi-long mark (a macron with a short perpendicular 68(a macron with a short perpendicular vertical stroke attached above) are
71vertical stroke attached above) are represented by <xsl/; the 69represented by <xsl/; the circumflex vowels have codes on this list, but may
72circumflex vowels have codes on this list, but may also be represented 70also be represented as <xcir/; vowels with macrons above are <xmac/
73as <xcir/; vowels with macrons above are <xmac/ (including <oomac/, 71(including <oomac/, the "oo" with an unbroken macron above the two letters,
74the "oo" with an unbroken macron above the two letters, <aemac/ = the 72<aemac/ = the ligature ae with a macron [also 214 = \'d6], and <oemac/ the
75ligature ae with a macron [also 214 = \'d6], and <oemac/ the ligature 73ligature oe with a macron [also 215 = \'d7]); vowels with umlauts or a
76oe with a macron [also 215 = \'d7]); vowels with umlauts or a crescent 74crescent (breve) above have codes in this list, but may also be represented
77(breve) above have codes in this list, but may also be represented by 75by <xum/ and <xcr/ respectively. There is an occasional hacek or caron mark
78<xum/ and <xcr/ respectively. There is an occasional hacek or caron 76(an inverted circumflex) in the original; such letters are coded <xcar/.
79mark (an inverted circumflex) in the original; such letters are coded 77The o with a caron has code 213, but no other letter with a caron is
80<xcar/. The o with a caron has code 213, but no other letter with a 78representable by an escape sequence.
81caron is representable by an escape sequence. 79
82 80The diaeresis is treated typographically as identical to the umlaut. A
83The diaeresis is treated typographically as identical to the umlaut. 81special modification, used only for poetry (see entry "saturnian verse"
84A special modification, used only for poetry (see entry "saturnian 82under "saturnian") is a vowel with a macron, in which the macron is lighter
85verse" under "saturnian") is a vowel with a macron, in which the 83than the usual macron, signifying a stressed syllable which has a short
86macron is lighter than the usual macron, signifying a stressed 84vowel sound. This is represented by <xsmac/ ("short mac").
87syllable which has a short vowel sound. This is represented by 85
88<xsmac/ ("short mac"). 86Another special character used in pronunciations is an "n" with an underline
89 87(like a macron, but below the letter), used to represent the "ng" sound.
90Another special character used in pronunciations is an "n" with an 88This is coded <nsm/ ("n sub-macron"). The ligated th used in pronunciations
91underline (like a macron, but below the letter), used to represent the 89to depict the "th" sound of "the" is coded as <th/.
92"ng" sound. This is coded <nsm/ ("n sub-macron"). The ligated th
93used in pronunciations to depict the "th" sound of "the" is coded as
94<th/.
95 90
96NOTE: the letter combinations "fi" and "fl" are invariably printed as the 91NOTE: the letter combinations "fi" and "fl" are invariably printed as the
97ligatures &filig; and &fllig;, but these ligatures are not marked as such 92ligatures &filig; and &fllig;, but these ligatures are not marked as such in
98in this transcription, and the two letters are left as individuals. 93this transcription, and the two letters are left as individuals.
99 94
100* Special symbols 95* Special symbols
101 96
102The dagger <dag/, double dagger <ddag/, and paragraph mark <para/ are 97The dagger <dag/, double dagger <ddag/, and paragraph mark <para/ are rarely
103rarely used. 98used.
104 99
105The double prime, or "seconds" of a degree is sometimes represented by 100The double prime, or "seconds" of a degree is sometimes represented by a
106a double "light accent" (code 183 = \'b7). In other places, and in 101double "light accent" (code 183 = \'b7). In other places, and in later
107later versions, it is represented by <sec/ = \'a9. 102versions, it is represented by <sec/ = \'a9.
108 103
109The symbols "greater than" <gt/ and "less than" are encountered only 104The symbols "greater than" <gt/ and "less than" are encountered only once,
110once, but are distinguished from the right- and left-angle brackets (> 105but are distinguished from the right- and left-angle brackets (> and <)
111and <) because of possible typographical differences in some fonts. 106because of possible typographical differences in some fonts.
112 107
113The schwa is symbolized by <schwa/. It is not used in the 108The schwa is symbolized by <schwa/. It is not used in the pronunciations,
114pronunciations, but is mentioned as a symbol. The right-pointing 109but is mentioned as a symbol. The right-pointing arrow is <rarr/,
115arrow is <rarr/, consistent with ISO 8879. 110consistent with ISO 8879.
111
112Two special entities <and/ and <or/ represent words "and" and "or" in
113italics font.
116 114
117* Symbol summary 115* Symbol summary
118 116
119Below is a complete list of the symbols used in the Webster, together 117Below is a complete list of the symbols used in the Webster, together with
120with their "webfont" number (escape sequence), corresponding markup 118their "webfont" number (escape sequence), corresponding markup entity, and
121entity, and corresponding symbols in ISO 8879 and Tex coding. Much of 119corresponding symbols in ISO 8879 and Tex coding. Much of this table was
122this table was prepared by Rik Faith, to whom we express our 120prepared by Rik Faith, to whom we express our appreciation.
123appreciation.
124 121
125The "Uc" column gives the Unicode representation of the character. 122The "Uc" column gives the Unicode representation of the character. The
126The "nearest ASCII" equivalents are given for those who want to 123"nearest ASCII" equivalents are given for those who want to display the data
127display the data as best one can in 7-bit simple ASCII symbols without 124as best one can in 7-bit simple ASCII symbols without using the "entity"
128using the "entity" symbols. 125symbols.
129 126
130Comments: 127Comments:
131 (1) The symbol in the "entity" column is the SGML-like symbol used in 128 (1) The symbol in the "entity" column is the SGML-like symbol used in the
132 the present Webster files; the symbol in the "ISO 8879" column is 129 present Webster files; the symbol in the "ISO 8879" column is the
133 the symbol for the same character given in "The user's guide to 130 symbol for the same character given in "The user's guide to ISO 8879"
134 ISO 8879" by Smith and Stutely. 131 by Smith and Stutely.
135 (2) An asterisk "*" in the "entity" column means that this symbol and 132
136code value is not used in any form in GCIDE. 133 (2) An asterisk "*" in the "entity" column means that this symbol and code
137 (3) If no asterisk is in the "entity" column, and no other symbol is 134value is not used in any form in GCIDE.
135
136 (3) If no asterisk is in the "entity" column, and no other symbol is
138there, this means that in the Webster, only the hexadecimal representation 137there, this means that in the Webster, only the hexadecimal representation
139was used (e.g. for \'d8, \'bd, and \'b8). 138was used (e.g. for \'d8, \'bd, and \'b8).
140 (4) \'b6 and \'b7, the heavy and light "accents", are never above a 139
141letter (these are not diacritical marks), but in-between letters, as the 140 (4) \'b6 and \'b7, the heavy and light "accents", are never above a letter
142stress accent used in th