diff options
author | Sergey Poznyakoff <gray@gnu.org.ua> | 2012-02-03 12:48:52 +0200 |
---|---|---|
committer | Sergey Poznyakoff <gray@gnu.org.ua> | 2012-02-03 12:48:52 +0200 |
commit | d18a469b7a5a4d4b5da21eab37f34ab1e99a8dce (patch) | |
tree | 7eb331e376e85287c25b6a9734dae58a4724da8a /webfont.txt | |
parent | 4a458db06b28492a7e48b1a0560b35778e476482 (diff) | |
download | gcide-d18a469b7a5a4d4b5da21eab37f34ab1e99a8dce.tar.gz gcide-d18a469b7a5a4d4b5da21eab37f34ab1e99a8dce.tar.bz2 |
Revise tagset.txt
* tagset.txt: Review.
* README: Reformat.
* webfont.txt: Reformat. Document <and/ and <or/.
Diffstat (limited to 'webfont.txt')
-rw-r--r-- | webfont.txt | 302 |
1 files changed, 153 insertions, 149 deletions
diff --git a/webfont.txt b/webfont.txt index d432fe5..f7423e1 100644 --- a/webfont.txt +++ b/webfont.txt | |||
@@ -3,163 +3,169 @@ | |||
3 | 3 | ||
4 | * Overview | 4 | * Overview |
5 | 5 | ||
6 | This file describes special symbols and markup entities used in the | 6 | This file describes special symbols and markup entities used in the GNU |
7 | GNU Collaborative International Dictionary of English. | 7 | Collaborative International Dictionary of English. |
8 | 8 | ||
9 | * Introduction | 9 | * Introduction |
10 | 10 | ||
11 | The special characters used in the electronic version of the Webster | 11 | The special characters used in the electronic version of the Webster 1913 |
12 | 1913 are required for visualizing unusual characters used in the | 12 | are required for visualizing unusual characters used in the etymology and |
13 | etymology and pronunciation fields of the dictionary, in a form | 13 | pronunciation fields of the dictionary, in a form comparable to the way they |
14 | comparable to the way they appear in the original. | 14 | appear in the original. |
15 | 15 | ||
16 | The GCIDE markup provides two ways for representing such characters: | 16 | The GCIDE markup provides two ways for representing such characters: using |
17 | using special "escape sequences" and using special markup entities. | 17 | special "escape sequences" and using special markup entities. Historically, |
18 | Historically, "escape sequences" were used to indicate the | 18 | "escape sequences" were used to indicate the character's ordinal position in |
19 | character's ordinal position in a special font, prepared by MICRA, | 19 | a special font, prepared by MICRA, Inc. to represent it on screen. Although |
20 | Inc. to represent it on screen. Although nowadays this method is | 20 | nowadays this method is obsolete, the dictionary corpus still uses these |
21 | obsolete, the dictionary corpus still uses these sequences. This file | 21 | sequences. This file describes their mapping to Unicode characters. |
22 | describes their mapping to Unicode characters. | ||
23 | 22 | ||
24 | An escape sequence has the form \'xx, where "x" represent lowercase | 23 | An escape sequence has the form \'xx, where "x" represent lowercase |
25 | hexadecimal digits. For example, \'94 stands for "o" with diaeresis. | 24 | hexadecimal digits. For example, \'94 stands for "o" with diaeresis. There |
26 | There are only 256 such sequences. | 25 | are only 256 such sequences. |
27 | 26 | ||
28 | Special markup entities are able to represent a wider range of | 27 | Special markup entities are able to represent a wider range of characters. |
29 | characters. A markup entity is similar to SGML one, but has a | 28 | A markup entity is similar to SGML one, but has a different format. The |
30 | different format. The traditional &xx; format was judged inconvenient | 29 | traditional &xx; format was judged inconvenient because the ampersand is |
31 | because the ampersand is used frequently in the corpus. Instead, | 30 | used frequently in the corpus. Instead, GCIDE entities have the format |
32 | GCIDE entities have the format <WORD/, where "<" and "/" represent the | 31 | <WORD/, where "<" and "/" represent the beginning and end of the entity and |
33 | beginning and end of the entity and WORD represents the character | 32 | WORD represents the character itself. Valid WORDs are in some cases |
34 | itself. Valid WORDs are in some cases abbreviations (for compactness) | 33 | abbreviations (for compactness) of the ISO 8879 recommended symbols. |
35 | of the ISO 8879 recommended symbols. Characters representable by | 34 | Characters representable by escape sequences can also be represented by |
36 | escape sequences can also be represented by entities, but the reverse | 35 | entities, but the reverse is not true, due to a limited range of the former. |
37 | is not true, due to a limited range of the former. | 36 | |
38 | 37 | The Greek words appearing in the etymologies, when they are included, are | |
39 | The Greek words appearing in the etymologies, when they are included, | 38 | typed in a roman-letter transcription, which is described below in chapter |
40 | are typed in a roman-letter transcription, which is described below in | 39 | "Greek transliteration". |
41 | chapter "Greek transliteration". | ||
42 | 40 | ||
43 | * Unrecognized characters | 41 | * Unrecognized characters |
44 | 42 | ||
45 | Wherever the typists did not know the character to use, they usually | 43 | Wherever the typists did not know the character to use, they usually |
46 | inserted a reverse-video question mark (decimal 176). This appears in | 44 | inserted a reverse-video question mark (decimal 176). This appears in |
47 | full-ASCII versions as <?/. This mark was used both for characters in | 45 | full-ASCII versions as <?/. This mark was used both for characters in |
48 | non-ASCII fonts, and for unreadable characters (i.e., characters | 46 | non-ASCII fonts, and for unreadable characters (i.e., characters smeared in |
49 | smeared in the original or distorted in the copies available to the | 47 | the original or distorted in the copies available to the typists. The type |
50 | typists. The type in the original was in many places smeared and | 48 | in the original was in many places smeared and illegible at the left and |
51 | illegible at the left and right page margins; occasionally, small | 49 | right page margins; occasionally, small parts of words were blotted out by |
52 | parts of words were blotted out by plain white space). | 50 | plain white space). |
53 | 51 | ||
54 | * Italics | 52 | * Italics |
55 | 53 | ||
56 | In most places, italic font is represented by the tags <it>...</it> | 54 | In most places, italic font is represented by the tags <it>...</it> |
57 | surrounding the italic text, or by some other tag which also implies | 55 | surrounding the italic text, or by some other tag which also implies italic |
58 | italic font. In the pronunciations, however, where italicized vowels | 56 | font. In the pronunciations, however, where italicized vowels are used |
59 | are used among non-italic and other special characters to indicate | 57 | among non-italic and other special characters to indicate pronunciation, the |
60 | pronunciation, the special codes <ait/, <eit/, <iit/, <oit/, <uit/, | 58 | special codes <ait/, <eit/, <iit/, <oit/, <uit/, are also used to indicate |
61 | are also used to indicate the italicized vowel. | 59 | the italicized vowel. |
62 | 60 | ||
63 | * Diacritics | 61 | * Diacritics |
64 | 62 | ||
65 | Vowels with a circle above (as in Swedish) are coded <xring/ (x with a | 63 | Vowels with a circle above (as in Swedish) are coded <xring/ (x with a ring, |
66 | ring, or "degrees" mark over it); vowels with tilde over them are | 64 | or "degrees" mark over it); vowels with tilde over them are represented by |
67 | represented by <xtil/, where "x" is the vowel, as in <etil/ (<atil/ | 65 | <xtil/, where "x" is the vowel, as in <etil/ (<atil/ also has code 238); |
68 | also has code 238); letters with a dot above are represented by <xdot/ | 66 | letters with a dot above are represented by <xdot/ -- letter with a dot |
69 | -- letter with a dot below are represented by <xsdot/ ("subdot"); | 67 | below are represented by <xsdot/ ("subdot"); vowels with the semi-long mark |
70 | vowels with the semi-long mark (a macron with a short perpendicular | 68 | (a macron with a short perpendicular vertical stroke attached above) are |
71 | vertical stroke attached above) are represented by <xsl/; the | 69 | represented by <xsl/; the circumflex vowels have codes on this list, but may |
72 | circumflex vowels have codes on this list, but may also be represented | 70 | also be represented as <xcir/; vowels with macrons above are <xmac/ |
73 | as <xcir/; vowels with macrons above are <xmac/ (including <oomac/, | 71 | (including <oomac/, the "oo" with an unbroken macron above the two letters, |
74 | the "oo" with an unbroken macron above the two letters, <aemac/ = the | 72 | <aemac/ = the ligature ae with a macron [also 214 = \'d6], and <oemac/ the |
75 | ligature ae with a macron [also 214 = \'d6], and <oemac/ the ligature | 73 | ligature oe with a macron [also 215 = \'d7]); vowels with umlauts or a |
76 | oe with a macron [also 215 = \'d7]); vowels with umlauts or a crescent | 74 | crescent (breve) above have codes in this list, but may also be represented |
77 | (breve) above have codes in this list, but may also be represented by | 75 | by <xum/ and <xcr/ respectively. There is an occasional hacek or caron mark |
78 | <xum/ and <xcr/ respectively. There is an occasional hacek or caron | 76 | (an inverted circumflex) in the original; such letters are coded <xcar/. |
79 | mark (an inverted circumflex) in the original; such letters are coded | 77 | The o with a caron has code 213, but no other letter with a caron is |
80 | <xcar/. The o with a caron has code 213, but no other letter with a | 78 | representable by an escape sequence. |
81 | caron is representable by an escape sequence. | 79 | |
82 | 80 | The diaeresis is treated typographically as identical to the umlaut. A | |
83 | The diaeresis is treated typographically as identical to the umlaut. | 81 | special modification, used only for poetry (see entry "saturnian verse" |
84 | A special modification, used only for poetry (see entry "saturnian | 82 | under "saturnian") is a vowel with a macron, in which the macron is lighter |
85 | verse" under "saturnian") is a vowel with a macron, in which the | 83 | than the usual macron, signifying a stressed syllable which has a short |
86 | macron is lighter than the usual macron, signifying a stressed | 84 | vowel sound. This is represented by <xsmac/ ("short mac"). |
87 | syllable which has a short vowel sound. This is represented by | 85 | |
88 | <xsmac/ ("short mac"). | 86 | Another special character used in pronunciations is an "n" with an underline |
89 | 87 | (like a macron, but below the letter), used to represent the "ng" sound. | |
90 | Another special character used in pronunciations is an "n" with an | 88 | This is coded <nsm/ ("n sub-macron"). The ligated th used in pronunciations |
91 | underline (like a macron, but below the letter), used to represent the | 89 | to depict the "th" sound of "the" is coded as <th/. |
92 | "ng" sound. This is coded <nsm/ ("n sub-macron"). The ligated th | ||
93 | used in pronunciations to depict the "th" sound of "the" is coded as | ||
94 | <th/. | ||
95 | 90 | ||
96 | NOTE: the letter combinations "fi" and "fl" are invariably printed as the | 91 | NOTE: the letter combinations "fi" and "fl" are invariably printed as the |
97 | ligatures fi and fl, but these ligatures are not marked as such | 92 | ligatures fi and fl, but these ligatures are not marked as such in |
98 | in this transcription, and the two letters are left as individuals. | 93 | this transcription, and the two letters are left as individuals. |
99 | 94 | ||
100 | * Special symbols | 95 | * Special symbols |
101 | 96 | ||
102 | The dagger <dag/, double dagger <ddag/, and paragraph mark <para/ are | 97 | The dagger <dag/, double dagger <ddag/, and paragraph mark <para/ are rarely |
103 | rarely used. | 98 | used. |
104 | 99 | ||
105 | The double prime, or "seconds" of a degree is sometimes represented by | 100 | The double prime, or "seconds" of a degree is sometimes represented by a |
106 | a double "light accent" (code 183 = \'b7). In other places, and in | 101 | double "light accent" (code 183 = \'b7). In other places, and in later |
107 | later versions, it is represented by <sec/ = \'a9. | 102 | versions, it is represented by <sec/ = \'a9. |
108 | 103 | ||
109 | The symbols "greater than" <gt/ and "less than" are encountered only | 104 | The symbols "greater than" <gt/ and "less than" are encountered only once, |
110 | once, but are distinguished from the right- and left-angle brackets (> | 105 | but are distinguished from the right- and left-angle brackets (> and <) |
111 | and <) because of possible typographical differences in some fonts. | 106 | because of possible typographical differences in some fonts. |
112 | 107 | ||
113 | The schwa is symbolized by <schwa/. It is not used in the | 108 | The schwa is symbolized by <schwa/. It is not used in the pronunciations, |
114 | pronunciations, but is mentioned as a symbol. The right-pointing | 109 | but is mentioned as a symbol. The right-pointing arrow is <rarr/, |
115 | arrow is <rarr/, consistent with ISO 8879. | 110 | consistent with ISO 8879. |
111 | |||
112 | Two special entities <and/ and <or/ represent words "and" and "or" in | ||
113 | italics font. | ||
116 | 114 | ||
117 | * Symbol summary | 115 | * Symbol summary |
118 | 116 | ||
119 | Below is a complete list of the symbols used in the Webster, together | 117 | Below is a complete list of the symbols used in the Webster, together with |
120 | with their "webfont" number (escape sequence), corresponding markup | 118 | their "webfont" number (escape sequence), corresponding markup entity, and |
121 | entity, and corresponding symbols in ISO 8879 and Tex coding. Much of | 119 | corresponding symbols in ISO 8879 and Tex coding. Much of this table was |
122 | this table was prepared by Rik Faith, to whom we express our | 120 | prepared by Rik Faith, to whom we express our appreciation. |
123 | appreciation. | ||
124 | 121 | ||
125 | The "Uc" column gives the Unicode representation of the character. | 122 | The "Uc" column gives the Unicode representation of the character. The |
126 | The "nearest ASCII" equivalents are given for those who want to | 123 | "nearest ASCII" equivalents are given for those who want to display the data |
127 | display the data as best one can in 7-bit simple ASCII symbols without | 124 | as best one can in 7-bit simple ASCII symbols without using the "entity" |
128 | using the "entity" symbols. | 125 | symbols. |
129 | 126 | ||
130 | Comments: | 127 | Comments: |
131 | (1) The symbol in the "entity" column is the SGML-like symbol used in | 128 | (1) The symbol in the "entity" column is the SGML-like symbol used in the |
132 | the present Webster files; the symbol in the "ISO 8879" column is | 129 | present Webster files; the symbol in the "ISO 8879" column is the |
133 | the symbol for the same character given in "The user's guide to | 130 | symbol for the same character given in "The user's guide to ISO 8879" |
134 | ISO 8879" by Smith and Stutely. | 131 | by Smith and Stutely. |
135 | (2) An asterisk "*" in the "entity" column means that this symbol and | 132 | |
136 | code value is not used in any form in GCIDE. | 133 | (2) An asterisk "*" in the "entity" column means that this symbol and code |
137 | (3) If no asterisk is in the "entity" column, and no other symbol is | 134 | value is not used in any form in GCIDE. |
135 | |||
136 | (3) If no asterisk is in the "entity" column, and no other symbol is | ||
138 | there, this means that in the Webster, only the hexadecimal representation | 137 | there, this means that in the Webster, only the hexadecimal representation |
139 | was used (e.g. for \'d8, \'bd, and \'b8). | 138 | was used (e.g. for \'d8, \'bd, and \'b8). |
140 | (4) \'b6 and \'b7, the heavy and light "accents", are never above a | 139 | |
141 | letter (these are not diacritical marks), but in-between letters, as the | 140 | (4) \'b6 and \'b7, the heavy and light "accents", are never above a letter |
142 | stress accent used in th |