自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

51CTO首頁(yè)

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開(kāi)發(fā)者社區(qū)

WOT技術(shù)大會(huì)

公眾號(hào)矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫(kù)

在線學(xué)習(xí)

文章資源問(wèn)答課堂專(zhuān)欄直播

51CTO

鴻蒙開(kāi)發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營(yíng)

鴻蒙開(kāi)發(fā)者社區(qū)訂閱號(hào)

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開(kāi)發(fā)者社區(qū)視頻號(hào)

51CTO軟考題庫(kù)

賬號(hào)設(shè)置退出

charsets 中文man頁(yè)面

作者：佚名 2011-08-24 10:18:28

Linux 是一個(gè)國(guó)際性的操作系統(tǒng)。它的各種各樣實(shí)用程序和設(shè)備驅(qū)動(dòng)程序 (包括控制臺(tái)驅(qū)動(dòng)程序 ) 支持多種語(yǔ)言的字符集，包括帶有附加符號(hào)的拉丁字母表字符，重音符，連字(字母結(jié)合), 和全部非拉丁文字母表（包括希臘語(yǔ)，古代斯拉夫語(yǔ)，阿拉伯語(yǔ)，和希伯來(lái)語(yǔ)。 )

NAME

charsets - 程序員對(duì)字符集和國(guó)際化的觀點(diǎn)

描述

Linux 是一個(gè)國(guó)際性的操作系統(tǒng)。它的各種各樣實(shí)用程序和設(shè)備驅(qū)動(dòng)程序 (包括控制臺(tái)驅(qū)動(dòng)程序 ) 支持多種語(yǔ)言的字符集，包括帶有附加符號(hào)的拉丁字母表字符，重音符，連字(字母結(jié)合), 和全部非拉丁文字母表（包括希臘語(yǔ)，古代斯拉夫語(yǔ)，阿拉伯語(yǔ)，和希伯來(lái)語(yǔ)。 )

這份手冊(cè)以程序員的眼光去看待不同的字符集標(biāo)準(zhǔn)，以及它們是如何在 Linux 中調(diào)和在一起的。討論的標(biāo)準(zhǔn)包括 ASCII，ISO 8859，KOI8-R ， Unicode，ISO 2022 和 ISO 4873 。

ASCII

ASCII (,美國(guó)國(guó)家信息交換(用)標(biāo)準(zhǔn)(代)碼) 是最初的 7-bit字符集, 原先是為美式英語(yǔ)設(shè)計(jì)的。當(dāng)前它被 ECMA-6 標(biāo)準(zhǔn)所描述。

在英國(guó)使用一種 ASCII的變體（這變體是：用英國(guó)磅值的符號(hào)代替美國(guó)的 crosshatch/octothorpe/hash 的磅值符號(hào)）;當(dāng)需要時(shí)，美國(guó)的（符號(hào)）和英國(guó)的變體（符號(hào)）可以用"US ASCII"和"UK ASCII" 作為區(qū)別。

因?yàn)?Linux 是為美國(guó)設(shè)計(jì)的硬件寫(xiě)的, 它生來(lái)就支持 US ASCII 。

ISO 8859

ISO 8859 是一系列 10 ８-bit 字符集,它包含美國(guó) ASCII 的低位 (7 -bit ), 128 ～159 范圍內(nèi)的不可見(jiàn)控制字符，和 96 個(gè)定寬圖形（字符）在 160-255 里。。LP 這些字符集中，最重要是 ISO 8859-1 ( Latin-1 )。它生來(lái)就被 Linux 控制臺(tái)驅(qū)動(dòng)程序支持， X11R6 的支持得也很好，并且是 HTML 的基礎(chǔ)字符集。

Linux 下控制臺(tái)也支持其他的 8859 字符集，通過(guò)用戶(hù)模式實(shí)用程序( 例如 setfont(8)) 來(lái)修改鍵盤(pán)綁定和 EGA 圖形表格，以及進(jìn)行控制臺(tái)驅(qū)動(dòng)程序里的字體表格中的“user mapping（用戶(hù)影射）”。

下面是每個(gè)集合簡(jiǎn)短的描述：

8859-1 (Latin-1): Latin-1 覆蓋大多數(shù)的西歐語(yǔ)言，比如阿爾巴尼亞, 加泰羅尼亞語(yǔ), 丹麥, 荷蘭,英語(yǔ),法羅群島,芬蘭,法語(yǔ),德語(yǔ),加利西亞,愛(ài)爾蘭,冰島, 意大利，挪威，葡萄牙，西班牙和瑞典。缺少荷蘭的 ij連字（i與j合字），法國(guó)的 oe（o與e合字）和舊風(fēng)格的',,' 而德語(yǔ)中``（這樣的）引號(hào)是可以的。
8859-2 (Latin-2): Latin-2 支持大多數(shù)的拉丁文書(shū)寫(xiě)的斯拉夫語(yǔ)和中歐的語(yǔ)言：克羅地亞 , 捷克語(yǔ), 德語(yǔ), 匈牙利, 波蘭，羅馬尼亞，斯洛伐克，和斯洛文尼亞。
8859-3 (Latin-3): Latin-3 是世界語(yǔ),加里西亞 , 馬耳他人, 和土耳其語(yǔ)作者受歡迎的（語(yǔ)言）。
8859-4 (Latin-4): Latin-4 介紹了愛(ài)沙尼亞語(yǔ)，拉托維亞，和立陶宛的字符。它是實(shí)質(zhì)上過(guò)時(shí)的; 參見(jiàn) 8859-10 (Latin-6 ) 。
8859-5: 古代斯拉夫語(yǔ)字母支持保加利亞語(yǔ), 白俄羅斯語(yǔ),馬其頓語(yǔ), 俄語(yǔ), 塞爾維亞語(yǔ)和烏克蘭語(yǔ)。烏克蘭人讀帶有下挑筆的`geh'為`heh',和（當(dāng)）需要用帶有上挑筆的 ghe 寫(xiě)正確的ghe.參見(jiàn)下面的（關(guān)于）KOI8-R 的討論。（譯注：這些外國(guó)人書(shū)寫(xiě)習(xí)慣我們也不怎么需要理解吧，希望上面的解釋不要把人搞糊涂了）
8859-6: 支持阿拉伯語(yǔ)。 8859-6 字型表是分離字符格式的一種固定的字體，但是一個(gè)合適的顯示引擎應(yīng)該聯(lián)合這些來(lái)使用合適的詞首，中間字母，和***表格式。
8859-7: 支持現(xiàn)代的希臘語(yǔ)。
8859-8: 支持希伯來(lái)語(yǔ)。
8859-9 (Latin-5): 這是Latin-1 的一種變體，它用土耳其語(yǔ)的一些（字符）代替很少用的冰島語(yǔ)。
8859-10 (Latin-6): Latin 6 增加末因紐特(譯：對(duì)于last Inuit 我不知道是否是對(duì)的) (格陵蘭語(yǔ)) 和 Sami ( 拉普蘭語(yǔ) ) ，這些是 Lattin 4 中缺少的，來(lái)覆蓋整個(gè)北歐地區(qū)（的字符集）。 RFC 1345 列出了初步的和不同的“ latin 6 "。 Skolt Sami 仍然比這些需要更多的重音符號(hào)。
8859-13 (Latin-7)
8859-14 (Latin-8)
8859-15: 增加了歐洲符號(hào)和法國(guó)連字，它們是 Latin-1 里缺漏的。

KOI8-R

KOI8-R 是在俄國(guó)流行的一個(gè)非 ISO 字符集。下半部分是 US ASCII; 上部是比 ISO 8859-5 設(shè)計(jì)的更好的古斯拉夫字符集。

控制臺(tái)為了支持 KOI8-R 字符集，在 Linux 下，可以利用用戶(hù)模式實(shí)用程序修改鍵盤(pán)綁定和 EGA 圖形表格，以及在控制臺(tái)的驅(qū)動(dòng)程序中使用字體表“user mapping（用戶(hù)映射）”。

UNICODE（統(tǒng)[單]一代碼,寬[雙]字節(jié)字符集）

Unicode（ ISO 10646 ) 是一個(gè)標(biāo)準(zhǔn)，它的目標(biāo)是明白地表現(xiàn)在每種人類(lèi)語(yǔ)言中的每種已知字符。Unicode 的編碼是 32 位的 ( 舊些的版本使用了 16 位 ) 。在 Unicode 的一些信息可以在<http://www.unicode.com>獲得。

Linux 使用８位的 Unicode 轉(zhuǎn)移格式 (UTF-8 ) 表示 Unicode 。 UTF-8 是可變長(zhǎng)的 Unicode 編碼。使用１個(gè)字節(jié)給 7 bit 編碼，使用２個(gè)字節(jié)給１１ bit 編碼，使用３個(gè)字節(jié)給１６ bit 編碼，使用４個(gè)字節(jié)給２１ bit 編碼，使用５個(gè)字節(jié)給２６ bit 編碼，使用６個(gè)字節(jié)給３１ bit 編碼

讓 0,1 , x 代表零，一，或任意的位。字節(jié)0xxxxxxx 代表Unicode 00000000 0xxxxxxx，這個(gè)符號(hào)和 ASCII 0xxxxxxx 編碼的符號(hào)是一樣。這樣， ASCII 沒(méi)有改為 UTF-8，并且只用 ASCII 的人不會(huì)注意到任何變化：不在代碼，并且不在文件大小。

字節(jié) 110xxxxx 是一個(gè)2 字節(jié)代碼的開(kāi)始， 110xxxxx 10yyyyyy 組裝成 00000xxx xxyyyyyy 。字節(jié) 1110xxxx 是一個(gè) ３字節(jié)代碼的開(kāi)始， 1110xxxx 10yyyyyy 10zzzzzz 被組裝成 xxxxyyyy yyzzzzzz。（如果 UTF-8 使用 31-bit ISO 10646 編碼，那么這個(gè)級(jí)數(shù)就會(huì)延伸到 6 字節(jié)編碼）

對(duì)于 ISO-8859-1 的用戶(hù)而言，這意味著帶高位的字符編碼成兩個(gè)字節(jié)。這會(huì)令普通的文本文件增大１到２個(gè)百分點(diǎn)。不過(guò)沒(méi)有變換問(wèn)題, 因?yàn)?Unicode ISO-8859-1 符號(hào)的值等于他們的 ISO-8859-1 值 (用 8 個(gè)前導(dǎo)零做前綴) 。對(duì)于日語(yǔ)的用戶(hù)，這意味著原來(lái)常用的 16 位編碼將占 3 個(gè)字節(jié)，并且還要求有擴(kuò)展的映射表。許多日本人因此比較喜歡 ISO 2022 。

注意 UTF-8 是自我同步的： 10xxxxxx 是一條尾巴, 任何其它的字節(jié)是編碼的頭。ASCII 字節(jié)出現(xiàn)在 UTF-8 流中唯一的可能是作為自己出現(xiàn)。特別是, 不會(huì)有 NULs 或 " /'s 嵌入在那些比較大的編碼中。

因?yàn)榫幋a中的 ASCII，特別是, NUL 和'/', 沒(méi)有變化, 所以?xún)?nèi)核不會(huì)注意到在使用 UTF-8。它根本不在乎它正在處理的那字節(jié)代表什么東西。

Unicode 數(shù)據(jù)流的呈現(xiàn)通常是通過(guò)" subfont "表來(lái)操作，這個(gè)表是 Unicode 的一個(gè)子集到字符表格的映射。內(nèi)核內(nèi)部使用 Unicode 描述裝載入顯示內(nèi)存的 subfont。這意味著在 UTF-8 中的一個(gè)模式能使用 512 個(gè)不同的符號(hào)。這對(duì)于日語(yǔ)，漢語(yǔ)和朝鮮語(yǔ)來(lái)說(shuō)是不夠的，但是它滿(mǎn)足了大多數(shù)其它用途。

ISO 2022 AND ISO 4873

ISO 2022 和 4873 標(biāo)準(zhǔn)描述了一個(gè)基于 VT100 實(shí)現(xiàn)的字體控制模型． Linux 內(nèi)核和 xterm (1) ( 部分 ) 支持這個(gè)模型。它在日本和韓國(guó)很流行。

它有 4 個(gè)圖形的字符集，稱(chēng)為 G0 ， G1 ， G2 和 G3 ，并且其中之一是當(dāng)前的高位為０的編碼的字符集(最初 G0 ),而他們之一是當(dāng)前的高位為１的編碼的字符集(最初 G1 )。每種圖形的字符集有 94 或 96 個(gè)字符，并且是實(shí)際上是一個(gè) 7-bit字符集。它使用 040-0177 ( 041-0176 ) 或 0240-0377 ( 0241-0376 )編碼中的一個(gè)。G0 大小總是為 94，并且使用 041-0176 之間的編碼。

字符之間切換用轉(zhuǎn)換（shift functions）功能 ^N (SO 或 LS1), ^O (SI 或 LS0), ESC n (LS2), ESC o (LS3), ESC N (SS2), ESC O (SS3), ESC ~ (LS1R), ESC } (LS2R), ESC | (LS3R). LSn 把字符集Gn標(biāo)記為當(dāng)前字符集，用于高位為０的編碼。 LSnR 把字符集 Gn標(biāo)記為當(dāng)前字符集，用于高位為１的編碼。 SSn 把字符集Gn (n=2 or 3) 標(biāo)記為當(dāng)前字符集，只用于下一個(gè)字符（不管它的高位的值是什么）

94 字符的集合用做 Gn字符集是用一個(gè)逃逸序列 ESC ( xx （用于 G0），ESC ) xx （用于 G1）， ESC * xx （用于 G2），ESC + xx （用于 G3），等代表的．這里的 xx 是一個(gè)符號(hào)或者是在 ISO 2375 國(guó)際注冊(cè)編碼字符集中的一對(duì)符號(hào)。例如，ESC ( @ 選用 ISO 646 字符集作為GO， ESC ( A 選用 UK 標(biāo)準(zhǔn)字符集(用磅代替數(shù)字記號(hào)), ESC ( B 選擇 ASCII ( 用美元代替流通貨幣), ESC ( M 為非洲語(yǔ)言選擇一個(gè)字符集， ESC ( ! A 選擇古巴字符集, 等等. 等等.

94 字符的集合用做 Gn字符集是用一個(gè)逃逸序列 ESC - xx （對(duì)于 G1）, ESC . xx （對(duì)于 G2）或 ESC / xx （對(duì)于 G3）等表示．例如, ESC - G 選擇希伯萊字母表作為 G1.

多字節(jié)的字符集用做 Gn 字符集是用一個(gè)逃逸序列 ESC $ xx 或者 ESC $ ( xx （對(duì)于 G0）， ESC $ ) xx （對(duì)于 G1），ESC $ * xx （對(duì)于 G2），ESC $ + xx （對(duì)于 G3）等來(lái)表示．例如, ESC $ ( C 為 G0選擇韓國(guó)字符集. 日本字符集合由 ESC $ B選擇更多臨近的版本由ESC & @ ESC $ B選擇.

ISO 4873 規(guī)定了一個(gè)范圍比較窄的使用字符集，它的 G0是固定的 (總是 ASCII), 所以 G1, G2 和 G3只能被調(diào)用于高次序位編碼集。尤其是，不再使用 ^N 和 ^O，ESC ( xx 僅用于 xx=B, 和 ESC ) xx, ESC * xx, ESC + xx 分別等價(jià)于 ESC - xx, ESC . xx, ESC / xx．

參考

console(4), console_ioctl(4), console_codes(4), ascii(7), iso_8859_1(7), unicode(7), utf-8(7)

#p#

NAME

charsets - programmer's view of character sets and internationalization

DESCRIPTION

Linux is an international operating system. Various of its utilities and device drivers (including the console driver) support multilingual character sets including Latin-alphabet letters with diacritical marks, accents, ligatures, and entire non-Latin alphabets including Greek, Cyrillic, Arabic, and Hebrew.

This manual page presents a programmer's-eye view of different character-set standards and how they fit together on Linux. Standards discussed include ASCII, ISO 8859, KOI8-R, Unicode, ISO 2022 and ISO 4873. The primary emphasis is on character sets actually used as locale character sets, not the myriad others that can be found in data from other systems.

A complete list of charsets used in a officially supported locale in glibc 2.2.3 is: ISO-8859-{1,2,3,5,6,7,8,9,13,15}, CP1251, UTF-8, EUC-{KR,JP,TW}, KOI8-{R,U}, GB2312, GB18030, GBK, BIG5, BIG5-HKSCS and TIS-620 (in no particular order.) (Romanian may be switching to ISO-8859-16.)

ASCII

ASCII (American Standard Code For Information Interchange) is the original 7-bit character set, originally designed for American English. It is currently described by the ECMA-6 standard.

Various ASCII variants replacing the dollar sign with other currency symbols and replacing punctuation with non-English alphabetic characters to cover German, French, Spanish and others in 7 bits exist. All are deprecated; GNU libc doesn't support locales whose character sets aren't true supersets of ASCII. (These sets are also known as ISO-646, a close relative of ASCII that permitted replacing these characters.)

As Linux was written for hardware designed in the US, it natively supports ASCII.

ISO 8859

ISO 8859 is a series of 15 8-bit character sets all of which have US ASCII in their low (7-bit) half, invisible control characters in positions 128 to 159, and 96 fixed-width graphics in positions 160-255.

Of these, the most important is ISO 8859-1 (Latin-1). It is natively supported in the Linux console driver, fairly well supported in X11R6, and is the base character set of HTML.

Console support for the other 8859 character sets is available under Linux through user-mode utilities (such as setfont(8)) that modify keyboard bindings and the EGA graphics table and employ the "user mapping" font table in the console driver.

Here are brief descriptions of each set:

8859-1 (Latin-1): Latin-1 covers most Western European languages such as Albanian, Catalan, Danish, Dutch, English, Faroese, Finnish, French, German, Galician, Irish, Icelandic, Italian, Norwegian, Portuguese, Spanish, and Swedish. The lack of the ligatures Dutch ij, French oe and old-style ,,German`` quotation marks is considered tolerable.
8859-2 (Latin-2): Latin-2 supports most Latin-written Slavic and Central European languages: Croatian, Czech, German, Hungarian, Polish, Rumanian, Slovak, and Slovene.
8859-3 (Latin-3): Latin-3 is popular with authors of Esperanto, Galician, and Maltese. (Turkish is now written with 8859-9 instead.)
8859-4 (Latin-4): Latin-4 introduced letters for Estonian, Latvian, and Lithuanian. It is essentially obsolete; see 8859-10 (Latin-6) and 8859-13 (Latin-7).
8859-5: Cyrillic letters supporting Bulgarian, Byelorussian, Macedonian, Russian, Serbian and Ukrainian. Ukrainians read the letter `ghe' with downstroke as `heh' and would need a ghe with upstroke to write a correct ghe. See the discussion of KOI8-R below.
8859-6: Supports Arabic. The 8859-6 glyph table is a fixed font of separate letter forms, but a proper display engine should combine these using the proper initial, medial, and final forms.
8859-7: Supports Modern Greek.
8859-8: Supports modern Hebrew without niqud (punctuation signs). Niqud and full-fledged Biblical Hebrew are outside the scope of this character set; under Linux, UTF-8 is the preferred encoding for these.
8859-9 (Latin-5): This is a variant of Latin-1 that replaces Icelandic letters with Turkish ones.
8859-10 (Latin-6): Latin 6 adds the last Inuit (Greenlandic) and Sami (Lappish) letters that were missing in Latin 4 to cover the entire Nordic area. RFC 1345 listed a preliminary and different `latin6'. Skolt Sami still needs a few more accents than these.
8859-11: This only exists as a rejected draft standard. The draft standard was identical to TIS-620, which is used under Linux for Thai.
8859-12: This set does not exist. While Vietnamese has been suggested for this space, it does not fit within the 96 (non-combining) characters ISO 8859 offers. UTF-8 is the preferred character set for Vietnamese use under Linux.
8859-13 (Latin-7): Supports the Baltic Rim languages; in particular, it includes Latvian characters not found in Latin-4.
8859-14 (Latin-8): This is the Celtic character set, covering Gaelic and Welsh. This charset also contains the dotted characters needed for Old Irish.
8859-15 (Latin-9): This adds the Euro sign and French and Finnish letters that were missing in Latin-1.
8859-16 (Latin-10): This set covers many of the languages covered by 8859-2, and supports Romanian more completely then that set does.

KOI8-R

KOI8-R is a non-ISO character set popular in Russia. The lower half is US ASCII; the upper is a Cyrillic character set somewhat better designed than ISO 8859-5. KOI8-U is a common character set, based off KOI8-R, that has better support for Ukrainian. Neither of these sets are ISO-2022 compatible, unlike the ISO-8859 series.

Console support for KOI8-R is available under Linux through user-mode utilities that modify keyboard bindings and the EGA graphics table, and employ the "user mapping" font table in the console driver.

JIS X 0208

JIS X 0208 is a Japanese national standard character set. Though there are some more Japanese national standard character sets (like JIS X 0201, JIS X 0212, and JIS X 0213), this is the most important one. Characters are mapped into a 94x94 two-byte matrix, whose each byte is in the range 0x21-0x7e. Note that JIS X 0208 is a character set, not an encoding. This means that JIS X 0208 itself is not used for expressing text data. JIS X 0208 is used as a component to construct encodings such as EUC-JP, Shift_JIS, and ISO-2022-JP. EUC-JP is the most important encoding for Linux and includes US ASCII and JIS X 0208. In EUC-JP, JIS X 0208 characters are expressed in two bytes, each of which is the JIS X 0208 code plus 0x80.

KS X 1001

KS X 1001 is a Korean national standard character set. Just as JIS X 0208, characters are mapped into a 94x94 two-byte matrix. KS X 1001 is used like JIS X 0208, as a component to construct encodings such as EUC-KR, Johab, and ISO-2022-KR. EUC-KR is the most important encoding for Linux and includes US ASCII and KS X 1001. KS C 5601 is an older name for KS X 1001.

GB 2312

GB 2312 is a mainland Chinese national standard character set used to express simplified Chinese. Just like JIS X 0208, characters are mapped into a 94x94 two-byte matrix used to construct EUC-CN. EUC-CN is the most important encoding for Linux and includes US ASCII and GB 2312. Note that EUC-CN is often called as GB, GB 2312, or CN-GB.

Big5

Big5 is a popular character set in Taiwan to express traditional Chinese. (Big5 is both a character set and an encoding.) It is a superset of US ASCII. Non-ASCII characters are expressed in two bytes. Bytes 0xa1-0xfe are used as leading bytes for two-byte characters. Big5 and its extension is widely used in Taiwan and Hong Kong. It is not ISO 2022-compliant.

TIS 620

TIS 620 is a Thai national standard character set and a superset of US ASCII. Like ISO 8859 series, Thai characters are mapped into 0xa1-0xfe. TIS 620 is the only commonly used character set under Linux besides UTF-8 to have combining characters.

UNICODE

Unicode (ISO 10646) is a standard which aims to unambiguously represent every character in every human language. Unicode's structure permits 20.1 bits to encode every character. Since most computers don't include 20.1-bit integers, Unicode is usually encoded as 32-bit integers internally and either a series of 16-bit integers (UTF-16) (needing two 16-bit integers only when encoding certain rare characters) or a series of 8-bit bytes (UTF-8). Information on Unicode is available at <http://www.unicode.com>.

Linux represents Unicode using the 8-bit Unicode Transformation Format (UTF-8). UTF-8 is a variable length encoding of Unicode. It uses 1 byte to code 7 bits, 2 bytes for 11 bits, 3 bytes for 16 bits, 4 bytes for 21 bits, 5 bytes for 26 bits, 6 bytes for 31 bits.

Let 0,1,x stand for a zero, one, or arbitrary bit. A byte 0xxxxxxx stands for the Unicode 00000000 0xxxxxxx which codes the same symbol as the ASCII 0xxxxxxx. Thus, ASCII goes unchanged into UTF-8, and people using only ASCII do not notice any change: not in code, and not in file size.

A byte 110xxxxx is the start of a 2-byte code, and 110xxxxx 10yyyyyy is assembled into 00000xxx xxyyyyyy. A byte 1110xxxx is the start of a 3-byte code, and 1110xxxx 10yyyyyy 10zzzzzz is assembled into xxxxyyyy yyzzzzzz. (When UTF-8 is used to code the 31-bit ISO 10646 then this progression continues up to 6-byte codes.)

For most people who use ISO-8859 character sets, this means that the characters outside of ASCII are now coded with two bytes. This tends to expand ordinary text files by only one or two percent. For Russian or Greek users, this expands ordinary text files by 100%, since text in those languages is mostly outside of ASCII. For Japanese users this means that the 16-bit codes now in common use will take three bytes. While there are algorithmic conversions from some character sets (esp. ISO-8859-1) to Unicode, general conversion requires carrying around conversion tables, which can be quite large for 16-bit codes.

Note that UTF-8 is self-synchronizing: 10xxxxxx is a tail, any other byte is the head of a code. Note that the only way ASCII bytes occur in a UTF-8 stream, is as themselves. In particular, there are no embedded NULs or '/'s that form part of some larger code.

Since ASCII, and, in particular, NUL and '/', are unchanged, the kernel does not notice that UTF-8 is being used. It does not care at all what the bytes it is handling stand for.

Rendering of Unicode data streams is typically handled through `subfont' tables which map a subset of Unicode to glyphs. Internally the kernel uses Unicode to describe the subfont loaded in video RAM. This means that in UTF-8 mode one can use a character set with 512 different symbols. This is not enough for Japanese, Chinese and Korean, but it is enough for most other purposes.

At the current time, the console driver does not handle combining characters. So Thai, Sioux and any other script needing combining characters can't be handled on the console.

ISO 2022 AND ISO 4873

The ISO 2022 and 4873 standards describe a font-control model based on VT100 practice. This model is (partially) supported by the Linux kernel and by xterm(1). It is popular in Japan and Korea.

There are 4 graphic character sets, called G0, G1, G2 and G3, and one of them is the current character set for codes with high bit zero (initially G0), and one of them is the current character set for codes with high bit one (initially G1). Each graphic character set has 94 or 96 characters, and is essentially a 7-bit character set. It uses codes either 040-0177 (041-0176) or 0240-0377 (0241-0376). G0 always has size 94 and uses codes 041-0176.

Switching between character sets is done using the shift functions ^N (SO or LS1), ^O (SI or LS0), ESC n (LS2), ESC o (LS3), ESC N (SS2), ESC O (SS3), ESC ~ (LS1R), ESC } (LS2R), ESC | (LS3R). The function LSn makes character set Gn the current one for codes with high bit zero. The function LSnR makes character set Gn the current one for codes with high bit one. The function SSn makes character set Gn (n=2 or 3) the current one for the next character only (regardless of the value of its high order bit).

A 94-character set is designated as Gn character set by an escape sequence ESC ( xx (for G0), ESC ) xx (for G1), ESC * xx (for G2), ESC + xx (for G3), where xx is a symbol or a pair of symbols found in the ISO 2375 International Register of Coded Character Sets. For example, ESC ( @ selects the ISO 646 character set as G0, ESC ( A selects the UK standard character set (with pound instead of number sign), ESC ( B selects ASCII (with dollar instead of currency sign), ESC ( M selects a character set for African languages, ESC ( ! A selects the Cuban character set, etc. etc.

A 96-character set is designated as Gn character set by an escape sequence ESC - xx (for G1), ESC . xx (for G2) or ESC / xx (for G3). For example, ESC - G selects the Hebrew alphabet as G1.

A multibyte character set is designated as Gn character set by an escape sequence ESC $ xx or ESC $ ( xx (for G0), ESC $ ) xx (for G1), ESC $ * xx (for G2), ESC $ + xx (for G3). For example, ESC $ ( C selects the Korean character set for G0. The Japanese character set selected by ESC $ B has a more recent version selected by ESC & @ ESC $ B.

ISO 4873 stipulates a narrower use of character sets, where G0 is fixed (always ASCII), so that G1, G2 and G3 can only be invoked for codes with the high order bit set. In particular, ^N and ^O are not used anymore, ESC ( xx can be used only with xx=B, and ESC ) xx, ESC * xx, ESC + xx are equivalent to ESC - xx, ESC . xx, ESC / xx, respectively.

SEE ALSO

console(4), console_ioctl(4), console_codes(4), ascii(7), iso_8859-1(7), unicode(7), utf-8(7)

責(zé)任編輯：韓亞珊來(lái)源： CMPP.net

charsets 中文man

點(diǎn)贊

51CTO技術(shù)棧公眾號(hào)

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開(kāi)發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營(yíng)

<nobr id="7r3ps"><optgroup id="7r3ps"></optgroup></nobr>

<sub id="7r3ps"></sub>

<var id="7r3ps"></var>