diff options
author | Sergey Poznyakoff <gray@gnu.org.ua> | 2012-02-03 00:08:07 +0200 |
---|---|---|
committer | Sergey Poznyakoff <gray@gnu.org.ua> | 2012-02-03 00:08:07 +0200 |
commit | 4a458db06b28492a7e48b1a0560b35778e476482 (patch) | |
tree | ef19ae1addbb291801482465d9b6a923ba2417ed /webfont.txt | |
parent | 60c1ea4788f2702eeeba8453f158861091ed28b1 (diff) | |
download | gcide-4a458db06b28492a7e48b1a0560b35778e476482.tar.gz gcide-4a458db06b28492a7e48b1a0560b35778e476482.tar.bz2 |
Further work on ancillary files.
* webfont.txt: Use Unicode, rewrite character table and Greek
transliteration sections.
* pronunc.txt: Update.
* tagset.txt: Update.
Diffstat (limited to 'webfont.txt')
-rw-r--r-- | webfont.txt | 1046 |
1 files changed, 529 insertions, 517 deletions
diff --git a/webfont.txt b/webfont.txt index 591e980..d432fe5 100644 --- a/webfont.txt +++ b/webfont.txt | |||
@@ -1,88 +1,70 @@ | |||
1 | WEBSTER FONTS | 1 | WEBSTER FONTS |
2 | ============= | 2 | ============= |
3 | 3 | ||
4 | Fonts for the Webster 1913 Dictionary. | 4 | * Overview |
5 | For version 0.50 | 5 | |
6 | Last edit May 5, 2001 | 6 | This file describes special symbols and markup entities used in the |
7 | ______________________________________ | 7 | GNU Collaborative International Dictionary of English. |
8 | (This file contains some extended ASCII characters, and should be | 8 | |
9 | transmitted in binary mode) | 9 | * Introduction |
10 | ---------------------------------------------------------------------- | 10 | |
11 | 11 | The special characters used in the electronic version of the Webster | |
12 | This file describes a modified font for use in visualizing the | ||
13 | text of the 1913 "Webster's Revised Unabridged Dictionary" (W1913), | ||
14 | usable for the DOS operating system of IBM-compatible personal computers. | ||
15 | The electronic version of that dictionary and this font were prepared by | ||
16 | MICRA, Inc., Plainfield NJ, and are copyrighted (C) 1996 by MICRA, Inc. | ||
17 | For details of permissions and restrictions on using these files, see | ||
18 | the accompanying file "readme.web". | ||
19 | The special characters used in the electronic version of the Webster | ||
20 | 1913 are required for visualizing unusual characters used in the | 12 | 1913 are required for visualizing unusual characters used in the |
21 | etymology and pronunciation fields of the dictionary, in a form | 13 | etymology and pronunciation fields of the dictionary, in a form |
22 | comparable to the way they appear in the original. Since there are | 14 | comparable to the way they appear in the original. |
23 | more than 256 characters used in that dictionary, not all can be | 15 | |
24 | represented by single-byte codes, and are instead represented by | 16 | The GCIDE markup provides two ways for representing such characters: |
25 | SGML-style "short-form" symbols. (rather than the "entity" format | 17 | using special "escape sequences" and using special markup entities. |
26 | "&xx;" The ampersand is used frequently, and we prefer to leave | 18 | Historically, "escape sequences" were used to indicate the |
27 | the "<" as the only "escape" character) of the type <x/ where x | 19 | character's ordinal position in a special font, prepared by MICRA, |
28 | is a specific code for the symbol in the dictionary. | 20 | Inc. to represent it on screen. Although nowadays this method is |
29 | See the "Short Form" section below for details about such characters. | 21 | obsolete, the dictionary corpus still uses these sequences. This file |
30 | Note that the symbols used here are in some cases abbreviations | 22 | describes their mapping to Unicode characters. |
31 | (for compactness) of the ISO 8879 recommended symbols. If necessary, | 23 | |
32 | the table below allows simple replacement by alternate encodings. | 24 | An escape sequence has the form \'xx, where "x" represent lowercase |
33 | This symbol font can be loaded in IBM-compatible (x86) computers | 25 | hexadecimal digits. For example, \'94 stands for "o" with diaeresis. |
34 | running the DOS operating system by using the "font.bat" command file | 26 | There are only 256 such sequences. |
35 | in the "utils" directory. The fonts files for 8x14 and 8x16 fonts are | 27 | |
36 | "web14.fnt" and "web16.fnt" respectively. | 28 | Special markup entities are able to represent a wider range of |
37 | For those loading the Webster onto some machine other than an | 29 | characters. A markup entity is similar to SGML one, but has a |
38 | IBM-compatible running DOS, it will be necessary to provide a | 30 | different format. The traditional &xx; format was judged inconvenient |
39 | translation table, to convert these characters into a code that | 31 | because the ampersand is used frequently in the corpus. Instead, |
40 | can be handled by that computer. For this reason, I attach an | 32 | GCIDE entities have the format <WORD/, where "<" and "/" represent the |
41 | "explanation" for each character, for those who cannot view | 33 | beginning and end of the entity and WORD represents the character |
42 | the original DOS font. | 34 | itself. Valid WORDs are in some cases abbreviations (for compactness) |
43 | The DOS-loadable font does not contain all of the characters needed | 35 | of the ISO 8879 recommended symbols. Characters representable by |
44 | to depict the etymologies or the pronunciations. In addition to an | 36 | escape sequences can also be represented by entities, but the reverse |
45 | absence of several characters used in the pronunciations, no Greek letters are | 37 | is not true, due to a limited range of the former. |
46 | included. The Greek words appearing in the etymologies, | 38 | |
47 | when they are included, will be typed in a | 39 | The Greek words appearing in the etymologies, when they are included, |
48 | roman-letter transcription (See section on Greek transcription, below). | 40 | are typed in a roman-letter transcription, which is described below in |
49 | Only a very few Greek words have been thus transcribed as of the | 41 | chapter "Greek transliteration". |
50 | present version (version 0.41). | 42 | |
51 | Wherever the typists did not know the character to use, they | 43 | * Unrecognized characters |
52 | usually inserted a reverse-video question mark (decimal 176). | 44 | |
53 | This appears in full-ASCII versions as <?/. This mark was used both for | 45 | Wherever the typists did not know the character to use, they usually |
54 | characters in non-ASCII fonts, and for unreadable characters (i.e., | 46 | inserted a reverse-video question mark (decimal 176). This appears in |
55 | characters smeared in the original or distorted in the copies available | 47 | full-ASCII versions as <?/. This mark was used both for characters in |
56 | to the typists. The type in the original was in many places smeared and | 48 | non-ASCII fonts, and for unreadable characters (i.e., characters |
49 | smeared in the original or distorted in the copies available to the | ||
50 | typists. The type in the original was in many places smeared and | ||
57 | illegible at the left and right page margins; occasionally, small | 51 | illegible at the left and right page margins; occasionally, small |
58 | parts of words were blotted out by plain white space). | 52 | parts of words were blotted out by plain white space). |
59 | A character table for the high-order characters appears below. | 53 | |
60 | Under that is a list and description of most of the special characters | 54 | * Italics |
61 | used in the Webster files. | 55 | |
62 | Note that there are yet some characters used in the etymologies, | 56 | In most places, italic font is represented by the tags <it>...</it> |
63 | and some other symbols, which are not in this list. For example, the | ||
64 | vowels with a double dot *underneath*, e.g. a (as in all) have no representation | ||
65 | in this character set, and, where explicitly entered in the dictionary, | ||
66 | are represented by <xdd/ where "x" is the letter, as in "<add/". | ||
67 | |||
68 | ITALICS | ||
69 | ------- | ||
70 | In most places, italic font is represented by the tags <it>...</it> | ||
71 | surrounding the italic text, or by some other tag which also implies | 57 | surrounding the italic text, or by some other tag which also implies |
72 | italic font. In the pronunciations, however, where italicized vowels | 58 | italic font. In the pronunciations, however, where italicized vowels |
73 | are used among non-italic and other special characters to indicate | 59 | are used among non-italic and other special characters to indicate |
74 | pronunciation, the special codes <ait/, <eit/, <iit/, <oit/, <uit/, | 60 | pronunciation, the special codes <ait/, <eit/, <iit/, <oit/, <uit/, |
75 | are also used to indicate the italicized vowel. | 61 | are also used to indicate the italicized vowel. |
76 | 62 | ||
77 | DIACRITICS | 63 | * Diacritics |
78 | ------------- | 64 | |
79 | The European grave and acute accents are represented by the | 65 | Vowels with a circle above (as in Swedish) are coded <xring/ (x with a |
80 | standard (IBM PC) high-order codes. Other characters with diacritics | 66 | ring, or "degrees" mark over it); vowels with tilde over them are |
81 | are represented by special "entity" codes, and in some cases also | 67 | represented by <xtil/, where "x" is the vowel, as in <etil/ (<atil/ |
82 | are found in this special WEB1913 font, described below. | ||
83 | Vowels with a circle above (as in Swedish) are coded <xring/ | ||
84 | (x with a ring, or "degrees" mark over it); vowels with tilde over them | ||
85 | are represented by <xtil/, where "x" is the vowel, as in <etil/ (<atil/ | ||
86 | also has code 238); letters with a dot above are represented by <xdot/ | 68 | also has code 238); letters with a dot above are represented by <xdot/ |
87 | -- letter with a dot below are represented by <xsdot/ ("subdot"); | 69 | -- letter with a dot below are represented by <xsdot/ ("subdot"); |
88 | vowels with the semi-long mark (a macron with a short perpendicular | 70 | vowels with the semi-long mark (a macron with a short perpendicular |
@@ -93,70 +75,57 @@ the "oo" with an unbroken macron above the two letters, <aemac/ = the | |||
93 | ligature ae with a macron [also 214 = \'d6], and <oemac/ the ligature | 75 | ligature ae with a macron [also 214 = \'d6], and <oemac/ the ligature |
94 | oe with a macron [also 215 = \'d7]); vowels with umlauts or a crescent | 76 | oe with a macron [also 215 = \'d7]); vowels with umlauts or a crescent |
95 | (breve) above have codes in this list, but may also be represented by | 77 | (breve) above have codes in this list, but may also be represented by |
96 | <xum/ and <xcr/ respectively. There is an occasional hacek or caron mark | 78 | <xum/ and <xcr/ respectively. There is an occasional hacek or caron |
97 | (an inverted circumflex) in the original; such letters are coded <xcar/. | 79 | mark (an inverted circumflex) in the original; such letters are coded |
98 | The o with a caron has code 213, but no others are in this font list. | 80 | <xcar/. The o with a caron has code 213, but no other letter with a |
81 | caron is representable by an escape sequence. | ||
82 | |||
99 | The diaeresis is treated typographically as identical to the umlaut. | 83 | The diaeresis is treated typographically as identical to the umlaut. |
100 | A special modification, used only for poetry (see entry "saturnian verse" | 84 | A special modification, used only for poetry (see entry "saturnian |
101 | under "saturnian") is a vowel with a macron, in which the macron is lighter | 85 | verse" under "saturnian") is a vowel with a macron, in which the |
102 | than the usual macron, signifying a stressed syllable which has a short | 86 | macron is lighter than the usual macron, signifying a stressed |
103 | vowel sound. This is represented by <xsmac/ ("short mac"). | 87 | syllable which has a short vowel sound. This is represented by |
104 | Another special character used in pronunciations is an "n" with an underline (like | 88 | <xsmac/ ("short mac"). |
105 | a macron, but below the letter), used to represent the "ng" sound. This is coded | 89 | |
106 | <nsm/ ("n sub-macron"). The ligated th used in pronunciations to depict the | 90 | Another special character used in pronunciations is an "n" with an |
107 | "th" sound of "the" is coded as <th/. | 91 | underline (like a macron, but below the letter), used to represent the |
108 | NOTE: the letter combinations "fi" and "fl" are invariably printed as the | 92 | "ng" sound. This is coded <nsm/ ("n sub-macron"). The ligated th |
93 | used in pronunciations to depict the "th" sound of "the" is coded as | ||
94 | <th/. | ||
95 | |||
96 | NOTE: the letter combinations "fi" and "fl" are invariably printed as the | ||
109 | ligatures fi and fl, but these ligatures are not marked as such | 97 | ligatures fi and fl, but these ligatures are not marked as such |
110 | in this transcription, and the two letters are left as individuals. | 98 | in this transcription, and the two letters are left as individuals. |
111 | 99 | ||
112 | SPECIAL SYMBOLS | 100 | * Special symbols |
113 | The dagger <dag/, double dagger <ddag/, and paragraph mark <para/ are rarely used. | 101 | |
114 | The double prime, or "seconds" of a degree is sometimes represented by | 102 | The dagger <dag/, double dagger <ddag/, and paragraph mark <para/ are |
115 | a double "light accent" (code 183 = \'b7). In other places, and in later | 103 | rarely used. |
116 | versions, it is represented by <sec/ = hex a9, in the webfont. | 104 | |
117 | The symbols "greater than" <gt/ and "less than" are encountered only | 105 | The double prime, or "seconds" of a degree is sometimes represented by |
118 | once, but are distinguished from the right- and left-angle brackets | 106 | a double "light accent" (code 183 = \'b7). In other places, and in |
119 | (> and <) because of possible typographical differences in some fonts. | 107 | later versions, it is represented by <sec/ = \'a9. |
120 | The schwa is symbolized by <schwa/. It is not used in the | 108 | |
121 | pronunciations, but is mentioned as a symbol. | 109 | The symbols "greater than" <gt/ and "less than" are encountered only |
122 | The right-pointing arrow is <rarr/, consistent with ISO 8879. | 110 | once, but are distinguished from the right- and left-angle brackets (> |
123 | 111 | and <) because of possible typographical differences in some fonts. | |
124 | ---------------------------------- | 112 | |
125 | Table 1 | 113 | The schwa is symbolized by <schwa/. It is not used in the |
126 | ---------------------------------- | 114 | pronunciations, but is mentioned as a symbol. The right-pointing |
127 | Numbers | 115 | arrow is <rarr/, consistent with ISO 8879. |
128 | Hex codes | 116 | |
129 | 1 | 117 | * Symbol summary |
130 | 11 (12 is a hard page break, 13 CR, 14 sect break) | 118 | |
131 | 21 | 119 | Below is a complete list of the symbols used in the Webster, together |
132 | 31 !"# $%&'( | 120 | with their "webfont" number (escape sequence), corresponding markup |
133 | 121 yz{|} ~ 79-7d 7e-82 | 121 | entity, and corresponding symbols in ISO 8879 and Tex coding. Much of |
134 | 131 83-87 88-8c | 122 | this table was prepared by Rik Faith, to whom we express our |
135 | 141 8d-91 92-96 | 123 | appreciation. |
136 | 151 97-9b 9c-a0 | 124 | |
137 | 161 a1-a5 a6-aa | 125 | The "Uc" column gives the Unicode representation of the character. |
138 | 171 ab-af b0-b4 | 126 | The "nearest ASCII" equivalents are given for those who want to |
139 | 181 b5-b9 ba-be | 127 | display the data as best one can in 7-bit simple ASCII symbols without |
140 | 191 bf-c3 c4-c8 | ||
141 | 201 c9-cd ce-d2 | ||
142 | 211 d3-d7 d8-dc | ||
143 | 221 dd-e1 e2-e6 | ||
144 | 231 e7-eb ec-f0 | ||
145 | 241 f1-f5 f6-fa | ||
146 | 251 fb-ff | ||
147 | |||
148 | =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- | ||
149 | Below is a complete list of the symbols used in the Webster ("webfont") | ||
150 | which are encoded in the special font listed above, together with | ||
151 | corresponding symbols in ISO 8879 and Tex coding. Much of this table was |