Last updated 2004-07-02 by Roedy
Green ©1996-2004 Canadian Mind Products
Java definitions: 0-9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
You are here : home : Java Glossary : E words : encoding.
However encodings are more versatile than that. They also let you read and write big or little endian 16-bit Unicode character streams. In theory encodings could support complex encoding structures, translation, compression, or quoting. One letter may become many or vice versa. Letters may be suppressed.
Encodings are usually trap door. When you translate to 8-bit you lose information. When you translate it back to Unicode some characters will not come back the same way they were originally. Some may even be missing.
The complete set of which encodings are supported is not documented, nor is there a documented way to find out just which ones are supported on your JVM. However, there are four sources of information:
| Encoding name | Supp-
orted? |
Description |
|---|---|---|
| 8859_1 | Y | Latin-1 ASCII (the USA default). This just takes the low order 8 bits and tacks on a high order 0 byte. Same as ISO8859_1. Microsoft's variant of Latin-1 is called Cp1252. |
| ASCII | Y | 7 bit ASCII, plus forms like \uxxxx for the exotic characters. |
| base64 | N | base64 source code is available. |
| base64u | N | base64u source code is available. A variant of Base64 that that is also URL-encoded. |
| base85 | N | |
| Big5 | Y | Big5, Traditional Chinese |
| Big5_HKSCS | Y | Big5 with Hong Kong extensions, Traditional Chinese |
| Big5_Solaris | Y | Big5 with seven additional Hanzi ideograph character mappings for the Solaris zh_TW.BIG5 locale |
| Cp037 | Y | USA, Canada (Bilingual, French), Netherlands, Portugal, Brazil, Australia, aka Cp1140 |
| Cp273 | Y | IBM Austria, Germany, aka Cp1141 |
| Cp277 | Y | IBM Denmark, Norway, aka Cp1142 |
| Cp278 | Y | IBM Finland, Sweden, aka Cp1143 |
| Cp280 | Y | IBM Italy, aka Cp1144 |
| Cp284 | Y | IBM Catalan/Spain, Spanish Latin America, aka Cp1145 |
| Cp285 | Y | IBM United Kingdom, Ireland, aka Cp1146 |
| Cp297 | Y | IBM France, aka Cp1147 |
| Cp420 | Y | IBM Arabic |
| Cp424 | Y | IBM Hebrew |
| Cp437 | Y | Original IBM PC OEM DOS character set (with line drawing characters and some Greek and math), MS-DOS United States, Australia, New Zealand, South Africa. The rest of the world uses Cp850 for the DOS box. |
| Cp500 | Y | EBCDIC 500V1, aka Cp1148 |
| Cp737 | Y | PC Greek |
| Cp775 | Y | PC Baltic |
| Cp838 | Y | IBM Thailand extended SBCS |
| Cp850Cp850 | Y | Microsoft DOS Multilingual Latin-1 (with line drawing characters). For true Latin-1 see 8859_1. See Cp437. |
| Cp852 | Y | Microsoft DOS Multilingual Latin-2 Slavic |
| Cp855 | Y | IBM Cyrillic |
| Cp857 | Y | IBM Turkish |
| Cp858 | Y | variant of Cp850 with the Euro. Microsoft DOS Multilingual Latin-1 (with line drawing characters). For true Latin-1 see 8859_1. |
| Cp860 | Y | MS-DOS Portuguese |
| Cp861 | Y | MS-DOS Icelandic |
| Cp862 | Y | PC Hebrew |
| Cp863 | Y | MS-DOS Canadian French |
| Cp864 | Y | PC Arabic |
| Cp865 | Y | MS-DOS Nordic |
| Cp866 | Y | MS-DOS Russian |
| Cp868 | Y | MS-DOS Pakistan |
| Cp869 | Y | IBM Modern Greek |
| Cp870 | Y | IBM Multilingual Latin-2 |
| Cp871 | Y | IBM Iceland, aka Cp1149 |
| Cp874 | Y | IBM Thai |
| Cp875 | Y | IBM Greek |
| Cp918 | Y | IBM Pakistan(Urdu) |
| Cp921 | Y | IBM Latvia, Lithuania (AIX, DOS) |
| Cp922 | Y | IBM Estonia (AIX, DOS) |
| Cp930 | Y | Japanese Katakana-Kanji mixed with 4370 UDC, superset of 5026 |
| Cp933 | Y | Korean Mixed with 1880 UDC, superset of 5029 |
| Cp935 | Y | Simplified Chinese Host mixed with 1880 UDC, superset of 5031 |
| Cp937 | Y | Traditional Chinese Host miexed with 6204 UDC, superset of 5033 |
| Cp939 | Y | Japanese Latin Kanji mixed with 4370 UDC, superset of 5035 |
| Cp942 | Y | Japanese (OS/2) superset of 932 |
| Cp942C | Y | variant of Cp942. Japanese (OS/2) superset of Cp932 |
| Cp943 | Y | Japanese (OS/2) superset of Cp932 and Shift-JIS. |
| Cp943C | Y | Variant of Cp943. Japanese (OS/2) superset of Cp932 and Shift-JIS. |
| Cp948 | Y | OS/2 Chinese (Taiwan) superset of 938 |
| Cp949 | Y | PC Korean |
| Cp949C | Y | variant of Cp949, PC Korean |
| Cp950 | Y | PC Chinese (Hong Kong, Taiwan) |
| Cp964 | Y | AIX Chinese (Taiwan) |
| Cp970 | Y | AIX Korean |
| Cp1006 | Y | IBM AIX Pakistan (Urdu) |
| Cp1025 | Y | IBM Multilingual Cyrillic: Bulgaria, Bosnia, Herzegovinia, Macedonia(FYR) |
| Cp1026 | Y | IBM Latin-5, Turkey |
| Cp1046 | Y | IBM Open Edition US EBCDIC |
| Cp1047 | Y | IBM System 390 EBCDIC, Java 1.2 only. |
| Cp1097 | Y | IBM Iran(Farsi)/Persian |
| Cp1098 | Y | IBM Iran(Farsi)/Persian (PC) |
| Cp1112 | Y | IBM Latvia, Lithuania |
| Cp1122 | Y | IBM Estonia |
| Cp1123 | Y | IBM Ukraine |
| Cp1124 | Y | IBM AIX Ukraine |
| Cp1140 | Y | USA, Canada (Bilingual, French), Netherlands, Portugal, Brazil, Australia |
| Cp1141 | Y | IBM Austria, Germany |
| Cp1142 | Y | IBM Denmark, Norway |
| Cp1143 | Y | IBM Finland, Sweden |
| Cp1144 | Y | IBM Italy |
| Cp1145 | Y | IBM Catalan/Spain, Spanish Latin America |
| Cp1146 | Y | IBM United Kingdom, Ireland |
| Cp1147 | Y | IBM France |
| Cp1148 | Y | EBCDIC 500V1 |
| Cp1149 | Y | IBM Iceland |
| Cp1250 | Y | Windows Eastern European |
| Cp1251 | Y | Windows Cyrillic |
| Cp1252 | Y | Microsoft Windows variant of Latin-1, NT default. Beware. Some unexpected translations occur when you read with this default encoding, e.g. codes 128..159 are translated to 16 bit chars with bits in the high order byte on. It does not just truncate the high byte on write and pad with 0 on read. For true Latin-1 see 8859_1. |
| Cp1253 | Y | Windows Greek |
| Cp1254 | Y | Windows Turkish |
| Cp1255 | Y | Windows Hebrew |
| Cp1256 | Y | Windows Arabic |
| Cp1257 | Y | Windows Baltic |
| Cp1258 | Y | Windows Vietnamese |
| Cp1381 | Y | IBM OS/2, DOS People's Republic of China (PRC) |
| Cp1383 | Y | IBM AIX People's Republic of China (PRC) |
| Cp33722 | Y | IBM-eucJP - Japanese (superset of 5050) |
| Default | Y | 7-bit ASCII (not the actual default!). Strips off the high order bit 7 and tacks on a high order 0 byte. The actual default is controlled in Windows 95/98/ME/NT/2000/XP by the control panel national settings. |
| EUC_CN | Y | GB2312, EUC encoding, Simplified Chinese |
| EUC_JP | Y | JIS0201, 0208, 0212, EUC Encoding, Japanese |
| EUC_JP_LINUX | Y | JISX0201, 0208, EUC Encoding, Japanese for LinuxYFF |
| EUC_KR | Y | KS C 5601, EUC Encoding, Korean |
| EUC_TW | Y | CNS11643 (Plane 1-3), T. Chinese, EUC encoding |
| GB18030 | Y | Simplified Chinese, PRC standard |
| GB2312 | Y | Chinese. Popular in email. |
| GBK | Y | GBK, Simplified Chinese |
| IBMOEM | N | |
| ISCII91 | Y | ISCII91 encoding of Indic scripts |
| ISO2022CN | Y | ISO 2022 CN, Chinese |
| ISO2022CN_CNS | Y | CNS 11643 in ISO-2022-CN form, T. Chinese |
| ISO2022CN_GB | Y | GB 2312 in ISO-2022-CN form, S. Chinese |
| ISO2022JP | Y | JIS0201, 0208, 0212, ISO2022 Encoding, Japanese |
| ISO2022KR | Y | ISO 2022 KR, Korean |
| ISO8859_1 | Y | ISO 8859-1, same as 8859_1, USA, Europe, Latin America, Caribbean, Canada, Africa, Latin-1, (Danish, Dutch, English, Faeroese, Finnish, French, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish and Swedish). Beware, for NT, the default is Cp1252 a variant of Latin-1, controlled by the control panel regional settings. |
| ISO8859_2 | Y | ISO 8859-2, Eastern Europe, Latin-2, (Albanian, Czech, English, German, Hungarian, Polish, Rumanian, (Serbo-)Croatian, Slovak, Slovene and Swedish) |
| ISO8859_3 | Y | ISO 8859-3, SE Europe/miscellaneous, Latin-3 (Afrikaans, Catalan, English,
|
| ISO8859_4 | Y | ISO 8859-4, Scandinavia/Baltic, Latin-4, (Danish, English, Estonian, Finnish, German, Greenlandic, Lappish, Latvian, Lithuanian, Norwegian and Swedish) |
| ISO8859_5 | Y | ISO 8859-5, Cyrillic, (Bulgarian, Bielorussian, English, Macedonian, Russian, Serb(o-Croat)ian and Ukrainian) |
| ISO8859_6 | Y | ISO 8859-6, Arabic ASMO 449 |
| ISO8859_7 | Y | ISO 8859-7, Greek ELOT-928 |
| ISO8859_8 | Y | ISO 8859-8, Hebrew |
| ISO8859_9 | Y | ISO 8859-9, Turkish Latin-5, (English, Finnish, French, German, Irish, Italian, Norwegian, Portuguese, Spanish and Swedish and Turkish) |
| ISO8859_10 | N | ISO 8859-10, Lappish/Nordic/Eskimo languages, Latin-6. (Danish, English, Estonian, Faeroese, Finnish, German, Greenlandic, Icelandic, Lappish, Latvian, Lithuanian, Norwegian and Swedish) |
| ISO8859_11 | N | ISO 8859-11, Thai. |
| ISO8859_12 | N | ISO 8859-12, Devanagari. |
| ISO8859_13 | N | ISO 8859-13, Baltic Rim, Latin-7. |
| ISO8859_14 | N | ISO 8859-14, Celtic, Latin-8. |
| ISO8859_15 | Y | ISO 8859-15, Euro, including Euro currency sign, aka Latin9, not Latin-15 as you would expect. Like Latin-1 with 8 replacements. |
| JIS | Y | Japanese |
| JIS0201 | Y | JIS 0201, Japanese |
| JIS0208 | Y | Japanese |
| JIS0208 | Y | JIS 0208, Japanese |
| JIS0212 | Y | JIS 0212, Japanese |
| JISAutoDetect | Y | Detects and converts from Shift-JIS, EUC-JP, ISO 2022 JP (conversion to Unicode only) |
| Johab | Y | Johab, Korean |
| KOI8_R | Y | KOI8-R, Russian |
| ks_c_5601-1987 | N | Korean standard often used in emails. See KSC5601. |
| KSC5601 | Y | Korean |
| Latin-1 | N | see 8859_1 and Cp1252. |
| Latin-2 | N | see 8859_2. |
| Latin-3 | N | see 8859_3. |
| Latin-4 | N | see 8869_4. |
| Latin Extended-A | N | MSWord |
| Latin Extended-B | N | MSWord |
| LocaleDefault | N | Mad as it sounds, the only way to get this is to look up the Locale default
such as
System.getProperty( "file.encoding" ); yourself and pass it explicitly or use a variant method that does not specify the encoding. default won't do it! In my opinion, all methods that use a LocaleDefault without an encoding parameter should be deprecated. |
| MacArabic | Y | Macintosh Arabic |
| MacCentralEurope | Y | Macintosh Latin-2 |
| MacCroatian | Y | Macintosh Croatian |
| MacCyrillic | Y | Macintosh Cyrillic |
| MacDingbat | Y | Macintosh Dingbat |
| MacGreek | Y | Macintosh Greek |
| MacHebrew | Y | Macintosh Hebrew |
| MacIceland | Y | Macintosh Iceland |
| MacRoman | Y | Macintosh Roman |
| MacRomania | Y | Macintosh Romania |
| MacSymbol | Y | Macintosh Symbol |
| MacThai | Y | Macintosh Thai |
| MacTurkish | Y | Macintosh Turkish |
| MacUkraine | Y | Macintosh Ukraine |
| MS874 | Y | Windows Thai |
| MS932 | Y | Windows Japanese. Microsoft JIS. |
| MS936 | Y | Windows Simpified Chinese PRC |
| MS949 | Y | Windows Korean |
| MS950 | Y | Windows Traditional Chinese |
| MS950_HKSCS | Y | Windows Traditional Chinese with Hong Kong extensions |
| SingleByte | Y | This does not expand low order eight-bits with high order zero as its name implies. It looks to be a complex encoding for some Asian language. |
| SJIS | Y | Shift JIS. Japanese. A Microsoft code that extends csHalfWidthKatakana to include kanji by adding a second byte when the value of the first byte is in the ranges 81-9F or E0-EF. |
| TIS620 | Y | TIS620, Thai |
| truncation | N | chop high byte, or 0-pad high byte. |
| Unicode | Y | Same as UnicodeBig. |
| UnicodeBig | Y | 16-bit UCS-2 Transformation Format, big endian byte order identified by an optional byte-order mark; FE FF. On read, defaults to big-endian. On write puts out a big-endian marker. Same as Unicode and UTF-16BE. |
| UnicodeBigUnmarked | Y | 16-bit UCS-2 Transformation Format, big endian byte order, definitely without Byte Order Mark. Not writtten on write, ignored on read. |
| UnicodeLittle | Y | 16-bit UCS-2 Transformation Format, little endian byte order identified by an optional byte-order mark; FF FE. On read, defaults to little-endian. On write puts out a little-endian marker. Same as UTF-16LE. |
| UnicodeLittleUnmarked | Y | 16-bit UCS-2 Transformation Format, little endian byte order, definitely without Byte Order Mark. Not writtten on write, ignored on read. |
| URL | N | For x-www-form-urlencoded use java.net.URLEncoder.encode and java.net.URLDecoder.decode instead. Used to encode GCI command lines. It encodes space as + and special characters as %xx hex. Don't confuse it with BASE64 or BASE64u. |
| US-ASCII | Y | 7-bit American Standard Code for Information Interchange. |
| Uuencode | N | Similar to base64. |
| UTF-7 | N | 7-bit encoded Unicode. |
| UTF-8 | Y | 8-bit encoded Unicode. neé UTF8. Optional marker on front of file: EF BB BF. DataOutputStreams have a binary length count in front of each string. Endianness does not apply to 8-bit encodings. |
| UTF-16 | Y | Same as Unicode. |
| UTF-16BE | Y | Same as UnicodeBigUnmarked. |
| UTF-16LE | Y | Same as UnicodeLittleUnmarked. |
| UTF-32 | N | 32-bit UCS-4 Transformation Format, byte order identified by an optional byte-order mark: 00 00 FF FE for little endian, FE FF 00 00 for big endian. |
| UTF-32BE | N | 32-bit UCS-4 Transformation Format, big-endian byte order. Same as UnicodeBigUnmarked. |
| UTF-32LE | N | 32-bit UCS-4 Transformation Format, little-endian byte order. Same as UnicodeLittleUnmarked |
| the wrapper | N | wrapper source code is available. A variant of Base64u that that is also URL-encoded. It also optionally handles serialiazation/reconstituting, compression/decompression, signing/verifying and heavy duty encryption/decryption. |
Many new encodings were added in Java 1.4.1 and some were dropped. This list contains even the dropped items. Beware, this list is not complete. Mainly it is missing all the new Windows and IBM proprietary encodings. Before you use an encoding, make sure it is supported by your version of Java.
Note that what Java and the HTML 4.0 specification call a "character encoding" is actually called a "character set" at IANA and in the HTTP proposed standard.
Here is how you would take a file in the old DOS IBMOEM encoding and bring it up to UTF-8 snuff for posting on the web.
Here is how you would take an UTF-8 file and convert it back to native format, e.g. Windows NT.
You can have a look at a registy dump this way:
regedit /E java.reg "HKEY_LOCAL_MACHINE\SOFTWARE\JavaSoft" native2ascii -encoding UnicodeLittle java.reg java.asc
You can convert between any two encodings, by going in two steps, via printable. Someone should probably write an improved version of this little utility that can convert from anything to anything in one step and that can will put the input back on top of the input file by default. You do this by creating a temporary output file in the same directory as the input file, then renaming when safely done.
Further, the encode/decode routines are permitted to combine pairs such as 0x0055 (LATIN CAPITAL LETTER U) followed by 0x0308 (COMBINING DIAERESIS) to a single character 0x00DC (LATIN CAPITAL LETTER U WITH DIAERESIS), or vice versa.
/** * detect chars that are translated by an encoding * by Jon Skeet * pass name of an encoding you want to test on the command line * e.g. 8859_1 ISO8859_4 Cp437 */ public class EncodingTest { public static void main( String [] args ) throws Exception { String encoding = args[0]; byte[] b = new byte[256]; for ( int i=0; i<256; i++ ) { b[i] = (byte)i; } String x = new String ( b , encoding ); for ( int i=0; i<x.length(); i++ ) { if ( x.charAt( i ) != i ) { System.out.println( i + " -> " + (int)x.charAt ( i ) ); } } } }
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
The Unicode little-endian or big-endian BOM (Byte Order Mark) is a strong clue you have 16-bit Unicode.
To automate the guessing, you could look for common foreign words to see how they are encoded. You could compute letter frequencies and compare them against documents with known encodings.
You might want to tackle this student project to solve the problem.
home |
Canadian Mind Products | |||
| mindprod.com IP:[24.87.56.253] | ||||
| Your IP:[80.134.30.163] | ||||
| You are visitor number 18584. | ||||
| Please send errors, omissions and suggestions | ||||
| to improve this page to Roedy Green. | ||||
| You can get a fresh copy of this page from: | or possibly from your local J: drive mirror: | |||
| http://mindprod.com/jgloss/encoding.html | J:\mindprod\jgloss\encoding.html | |||