Java Glossary : encoding

CMP home Java glossary home Menu no menu Last updated 2004-07-02 by Roedy Green ©1996-2004 Canadian Mind Products

Java definitions: 0-9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

You are here : home : Java Glossary : E words : encoding.

encoding
Normally Readers translate from various 8-bit byte streams to standard 16-bit Unicode to read. You can specify the sort of translation to use when you create the Reader. Similarly, normally Writers translate from internal 16 bit unicode into various 8-bit byte streams.

However encodings are more versatile than that. They also let you read and write big or little endian 16-bit Unicode character streams. In theory encodings could support complex encoding structures, translation, compression, or quoting. One letter may become many or vice versa. Letters may be suppressed.

Encodings are usually trap door. When you translate to 8-bit you lose information. When you translate it back to Unicode some characters will not come back the same way they were originally. Some may even be missing.

The complete set of which encodings are supported is not documented, nor is there a documented way to find out just which ones are supported on your JVM. However, there are four sources of information:

  1. That lists them, but does not tell you much about them. Note that java.nio uses different canonical names from java.io and java.lang.
  2. A place to look for supported character sets :
  3. Another place to look for supported character Sets :
  4. The following table.

Supported Encodings

Here are some encodings typically supported:
Encoding name Supp-
orted?
Description
8859_1 Y Latin-1 ASCII (the USA default). This just takes the low order 8 bits and tacks on a high order 0 byte. Same as ISO8859_1. Microsoft's variant of Latin-1 is called Cp1252.
ASCII Y 7 bit ASCII, plus forms like \uxxxx for the exotic characters.
base64 N base64 source code is available.
base64u N base64u source code is available. A variant of Base64 that that is also URL-encoded.
base85 N
Big5 Y Big5, Traditional Chinese
Big5_HKSCS Y Big5 with Hong Kong extensions, Traditional Chinese
Big5_Solaris Y Big5 with seven additional Hanzi ideograph character mappings for the Solaris zh_TW.BIG5 locale
Cp037 Y USA, Canada (Bilingual, French), Netherlands, Portugal, Brazil, Australia, aka Cp1140
Cp273 Y IBM Austria, Germany, aka Cp1141
Cp277 Y IBM Denmark, Norway, aka Cp1142
Cp278 Y IBM Finland, Sweden, aka Cp1143
Cp280 Y IBM Italy, aka Cp1144
Cp284 Y IBM Catalan/Spain, Spanish Latin America, aka Cp1145
Cp285 Y IBM United Kingdom, Ireland, aka Cp1146
Cp297 Y IBM France, aka Cp1147
Cp420 Y IBM Arabic
Cp424 Y IBM Hebrew
Cp437 Y Original IBM PC OEM DOS character set (with line drawing characters and some Greek and math), MS-DOS United States, Australia, New Zealand, South Africa. The rest of the world uses Cp850 for the DOS box.
Cp500 Y EBCDIC 500V1, aka Cp1148
Cp737 Y PC Greek
Cp775 Y PC Baltic
Cp838 Y IBM Thailand extended SBCS
Cp850Cp850 Y Microsoft DOS Multilingual Latin-1 (with line drawing characters). For true Latin-1 see 8859_1. See Cp437.
Cp852 Y Microsoft DOS Multilingual Latin-2 Slavic
Cp855 Y IBM Cyrillic
Cp857 Y IBM Turkish
Cp858 Y variant of Cp850 with the Euro. Microsoft DOS Multilingual Latin-1 (with line drawing characters). For true Latin-1 see 8859_1.
Cp860 Y MS-DOS Portuguese
Cp861 Y MS-DOS Icelandic
Cp862 Y PC Hebrew
Cp863 Y MS-DOS Canadian French
Cp864 Y PC Arabic
Cp865 Y MS-DOS Nordic
Cp866 Y MS-DOS Russian
Cp868 Y MS-DOS Pakistan
Cp869 Y IBM Modern Greek
Cp870 Y IBM Multilingual Latin-2
Cp871 Y IBM Iceland, aka Cp1149
Cp874 Y IBM Thai
Cp875 Y IBM Greek
Cp918 Y IBM Pakistan(Urdu)
Cp921 Y IBM Latvia, Lithuania (AIX, DOS)
Cp922 Y IBM Estonia (AIX, DOS)
Cp930 Y Japanese Katakana-Kanji mixed with 4370 UDC, superset of 5026
Cp933 Y Korean Mixed with 1880 UDC, superset of 5029
Cp935 Y Simplified Chinese Host mixed with 1880 UDC, superset of 5031
Cp937 Y Traditional Chinese Host miexed with 6204 UDC, superset of 5033
Cp939 Y Japanese Latin Kanji mixed with 4370 UDC, superset of 5035
Cp942 Y Japanese (OS/2) superset of 932
Cp942C Y variant of Cp942. Japanese (OS/2) superset of Cp932
Cp943 Y Japanese (OS/2) superset of Cp932 and Shift-JIS.
Cp943C Y Variant of Cp943. Japanese (OS/2) superset of Cp932 and Shift-JIS.
Cp948 Y OS/2 Chinese (Taiwan) superset of 938
Cp949 Y PC Korean
Cp949C Y variant of Cp949, PC Korean
Cp950 Y PC Chinese (Hong Kong, Taiwan)
Cp964 Y AIX Chinese (Taiwan)
Cp970 Y AIX Korean
Cp1006 Y IBM AIX Pakistan (Urdu)
Cp1025 Y IBM Multilingual Cyrillic: Bulgaria, Bosnia, Herzegovinia, Macedonia(FYR)
Cp1026 Y IBM Latin-5, Turkey
Cp1046 Y IBM Open Edition US EBCDIC
Cp1047 Y IBM System 390 EBCDIC, Java 1.2 only.
Cp1097 Y IBM Iran(Farsi)/Persian
Cp1098 Y IBM Iran(Farsi)/Persian (PC)
Cp1112 Y IBM Latvia, Lithuania
Cp1122 Y IBM Estonia
Cp1123 Y IBM Ukraine
Cp1124 Y IBM AIX Ukraine
Cp1140 Y USA, Canada (Bilingual, French), Netherlands, Portugal, Brazil, Australia
Cp1141 Y IBM Austria, Germany
Cp1142 Y IBM Denmark, Norway
Cp1143 Y IBM Finland, Sweden
Cp1144 Y IBM Italy
Cp1145 Y IBM Catalan/Spain, Spanish Latin America
Cp1146 Y IBM United Kingdom, Ireland
Cp1147 Y IBM France
Cp1148 Y EBCDIC 500V1
Cp1149 Y IBM Iceland
Cp1250 Y Windows Eastern European
Cp1251 Y Windows Cyrillic
Cp1252 Y Microsoft Windows variant of Latin-1, NT default. Beware. Some unexpected translations occur when you read with this default encoding, e.g. codes 128..159 are translated to 16 bit chars with bits in the high order byte on. It does not just truncate the high byte on write and pad with 0 on read. For true Latin-1 see 8859_1.
Cp1253 Y Windows Greek
Cp1254 Y Windows Turkish
Cp1255 Y Windows Hebrew
Cp1256 Y Windows Arabic
Cp1257 Y Windows Baltic
Cp1258 Y Windows Vietnamese
Cp1381 Y IBM OS/2, DOS People's Republic of China (PRC)
Cp1383 Y IBM AIX People's Republic of China (PRC)
Cp33722 Y IBM-eucJP - Japanese (superset of 5050)
Default Y 7-bit ASCII (not the actual default!). Strips off the high order bit 7 and tacks on a high order 0 byte. The actual default is controlled in Windows 95/98/ME/NT/2000/XP by the control panel national settings.
EUC_CN Y GB2312, EUC encoding, Simplified Chinese
EUC_JP Y JIS0201, 0208, 0212, EUC Encoding, Japanese
EUC_JP_LINUX Y JISX0201, 0208, EUC Encoding, Japanese for LinuxYFF
EUC_KR Y KS C 5601, EUC Encoding, Korean
EUC_TW Y CNS11643 (Plane 1-3), T. Chinese, EUC encoding
GB18030 Y Simplified Chinese, PRC standard
GB2312 Y Chinese. Popular in email.
GBK Y GBK, Simplified Chinese
IBMOEM N
ISCII91 Y ISCII91 encoding of Indic scripts
ISO2022CN Y ISO 2022 CN, Chinese
ISO2022CN_CNS Y CNS 11643 in ISO-2022-CN form, T. Chinese
ISO2022CN_GB Y GB 2312 in ISO-2022-CN form, S. Chinese
ISO2022JP Y JIS0201, 0208, 0212, ISO2022 Encoding, Japanese
ISO2022KR Y ISO 2022 KR, Korean
ISO8859_1 Y ISO 8859-1, same as 8859_1, USA, Europe, Latin America, Caribbean, Canada, Africa, Latin-1, (Danish, Dutch, English, Faeroese, Finnish, French, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish and Swedish). Beware, for NT, the default is Cp1252 a variant of Latin-1, controlled by the control panel regional settings.
ISO8859_2 Y ISO 8859-2, Eastern Europe, Latin-2, (Albanian, Czech, English, German, Hungarian, Polish, Rumanian, (Serbo-)Croatian, Slovak, Slovene and Swedish)
ISO8859_3 Y ISO 8859-3, SE Europe/miscellaneous, Latin-3 (Afrikaans, Catalan, English, verdastelo Esperanto, French, Galician, German, Italian, Maltese and Turkish)
ISO8859_4 Y ISO 8859-4, Scandinavia/Baltic, Latin-4, (Danish, English, Estonian, Finnish, German, Greenlandic, Lappish, Latvian, Lithuanian, Norwegian and Swedish)
ISO8859_5 Y ISO 8859-5, Cyrillic, (Bulgarian, Bielorussian, English, Macedonian, Russian, Serb(o-Croat)ian and Ukrainian)
ISO8859_6 Y ISO 8859-6, Arabic ASMO 449
ISO8859_7 Y ISO 8859-7, Greek ELOT-928
ISO8859_8 Y ISO 8859-8, Hebrew
ISO8859_9 Y ISO 8859-9, Turkish Latin-5, (English, Finnish, French, German, Irish, Italian, Norwegian, Portuguese, Spanish and Swedish and Turkish)
ISO8859_10 N ISO 8859-10, Lappish/Nordic/Eskimo languages, Latin-6. (Danish, English, Estonian, Faeroese, Finnish, German, Greenlandic, Icelandic, Lappish, Latvian, Lithuanian, Norwegian and Swedish)
ISO8859_11 N ISO 8859-11, Thai.
ISO8859_12 N ISO 8859-12, Devanagari.
ISO8859_13 N ISO 8859-13, Baltic Rim, Latin-7.
ISO8859_14 N ISO 8859-14, Celtic, Latin-8.
ISO8859_15 Y ISO 8859-15, Euro, including Euro currency sign, aka Latin9, not Latin-15 as you would expect. Like Latin-1 with 8 replacements.
JIS Y Japanese
JIS0201 Y JIS 0201, Japanese
JIS0208 Y Japanese
JIS0208 Y JIS 0208, Japanese
JIS0212 Y JIS 0212, Japanese
JISAutoDetect Y Detects and converts from Shift-JIS, EUC-JP, ISO 2022 JP (conversion to Unicode only)
Johab Y Johab, Korean
KOI8_R Y KOI8-R, Russian
ks_c_5601-1987 N Korean standard often used in emails. See KSC5601.
KSC5601 Y Korean
Latin-1 N see 8859_1 and Cp1252.
Latin-2 N see 8859_2.
Latin-3 N see 8859_3.
Latin-4 N see 8869_4.
Latin Extended-A N MSWord
Latin Extended-B N MSWord
LocaleDefault N Mad as it sounds, the only way to get this is to look up the Locale default such as

System.getProperty( "file.encoding" );

yourself and pass it explicitly or use a variant method that does not specify the encoding. default won't do it! In my opinion, all methods that use a LocaleDefault without an encoding parameter should be deprecated.

MacArabic Y Macintosh Arabic
MacCentralEurope Y Macintosh Latin-2
MacCroatian Y Macintosh Croatian
MacCyrillic Y Macintosh Cyrillic
MacDingbat Y Macintosh Dingbat
MacGreek Y Macintosh Greek
MacHebrew Y Macintosh Hebrew
MacIceland Y Macintosh Iceland
MacRoman Y Macintosh Roman
MacRomania Y Macintosh Romania
MacSymbol Y Macintosh Symbol
MacThai Y Macintosh Thai
MacTurkish Y Macintosh Turkish
MacUkraine Y Macintosh Ukraine
MS874 Y Windows Thai
MS932 Y Windows Japanese. Microsoft JIS.
MS936 Y Windows Simpified Chinese PRC
MS949 Y Windows Korean
MS950 Y Windows Traditional Chinese
MS950_HKSCS Y Windows Traditional Chinese with Hong Kong extensions
SingleByte Y This does not expand low order eight-bits with high order zero as its name implies. It looks to be a complex encoding for some Asian language.
SJIS Y Shift JIS. Japanese. A Microsoft code that extends csHalfWidthKatakana to include kanji by adding a second byte when the value of the first byte is in the ranges 81-9F or E0-EF.
TIS620 Y TIS620, Thai
truncation N chop high byte, or 0-pad high byte.
Unicode Y Same as UnicodeBig.
UnicodeBig Y 16-bit UCS-2 Transformation Format, big endian byte order identified by an optional byte-order mark; FE FF. On read, defaults to big-endian. On write puts out a big-endian marker. Same as Unicode and UTF-16BE.
UnicodeBigUnmarked Y 16-bit UCS-2 Transformation Format, big endian byte order, definitely without Byte Order Mark. Not writtten on write, ignored on read.
UnicodeLittle Y 16-bit UCS-2 Transformation Format, little endian byte order identified by an optional byte-order mark; FF FE. On read, defaults to little-endian. On write puts out a little-endian marker. Same as UTF-16LE.
UnicodeLittleUnmarked Y 16-bit UCS-2 Transformation Format, little endian byte order, definitely without Byte Order Mark. Not writtten on write, ignored on read.
URL N For x-www-form-urlencoded use java.net.URLEncoder.encode and java.net.URLDecoder.decode instead. Used to encode GCI command lines. It encodes space as + and special characters as %xx hex. Don't confuse it with BASE64 or BASE64u.
US-ASCII Y 7-bit American Standard Code for Information Interchange.
Uuencode N Similar to base64.
UTF-7 N 7-bit encoded Unicode.
UTF-8 Y 8-bit encoded Unicode. neé UTF8. Optional marker on front of file: EF BB BF. DataOutputStreams have a binary length count in front of each string. Endianness does not apply to 8-bit encodings.
UTF-16 Y Same as Unicode.
UTF-16BE Y Same as UnicodeBigUnmarked.
UTF-16LE Y Same as UnicodeLittleUnmarked.
UTF-32 N 32-bit UCS-4 Transformation Format, byte order identified by an optional byte-order mark: 00 00 FF FE for little endian, FE FF 00 00 for big endian.
UTF-32BE N 32-bit UCS-4 Transformation Format, big-endian byte order. Same as UnicodeBigUnmarked.
UTF-32LE N 32-bit UCS-4 Transformation Format, little-endian byte order. Same as UnicodeLittleUnmarked
the wrapper N wrapper source code is available. A variant of Base64u that that is also URL-encoded. It also optionally handles serialiazation/reconstituting, compression/decompression, signing/verifying and heavy duty encryption/decryption.
Where two fonts are shows separated by a /, the second one is the new version including the euro symbol. Adam Dingle did the research on how these encodings work.

Many new encodings were added in Java 1.4.1 and some were dropped. This list contains even the dropped items. Beware, this list is not complete. Mainly it is missing all the new Windows and IBM proprietary encodings. Before you use an encoding, make sure it is supported by your version of Java.

Note that what Java and the HTML 4.0 specification call a "character encoding" is actually called a "character set" at IANA and in the HTTP proposed standard.

ISO

You can buy documentation on the ISO code sets from the ironically named Organisation for International Standards. They cost approximately 50.00 CHF

Roll Your own

If you don't see the character set encoding you need, you can write your own translate/encoding tables and insert them as part of the official set. See the java.nio.charset.spi.CharsetProvider, Charset, CharsetEncoder and CharsetDecoder classes.

native2ascii

Sun has included a utility misnamed native2ascii.exe which is included with the JDK. It converts files from any encoding to 8-bit printable form, and back. 8-bit printable using ASCII characters plus forms like \u95e8 for the exotic characters.

Here is how you would take a file in the old DOS IBMOEM encoding and bring it up to UTF-8 snuff for posting on the web.


view

Here is how you would take an UTF-8 file and convert it back to native format, e.g. Windows NT.


view

You can have a look at a registy dump this way:

regedit /E java.reg "HKEY_LOCAL_MACHINE\SOFTWARE\JavaSoft"
native2ascii -encoding UnicodeLittle java.reg java.asc

You can convert between any two encodings, by going in two steps, via printable. Someone should probably write an improved version of this little utility that can convert from anything to anything in one step and that can will put the input back on top of the input file by default. You do this by creating a temporary output file in the same directory as the input file, then renaming when safely done.

Reversibility

You won't necessarily get exactly back to where you started if you encode then decode. If you chose a traditional single-byte, 8-bit encoding, say Cp437 as your target, there are only 256 encodings to go round for all 64K Unicode characters. Obviously, some Unicode characters are going to have to collapse onto the same 8-bit character, and so won't decode back to where they started. Further, some of these 8-bit encodings have a few strange characters that don't exist in Unicode. UTF-8 does not suffer from this problem.

Further, the encode/decode routines are permitted to combine pairs such as 0x0055 (LATIN CAPITAL LETTER U) followed by 0x0308 (COMBINING DIAERESIS) to a single character 0x00DC (LATIN CAPITAL LETTER U WITH DIAERESIS), or vice versa.

Tracking which characters get Translated Where

Be careful when translating between character sets using the encoding feature of Readers. Everything goes through the intermediate 16-bit Unicode which may not have all the characters of the target and destination character sets. Some characters may be translated to codes with some high byte bits on. For more accurate translation, do it yourself with a one-step table. You can use the following program to discover what translations are being done with any particular encoding, and use that information to generate the source for your own translate table, using the automatic encodings, so that you can see any inaccuracies and fix them.

/**
 * detect chars that are translated by an encoding
 * by Jon Skeet
 * pass name of an encoding you want to test on the command line
 * e.g. 8859_1 ISO8859_4 Cp437
 */
public class EncodingTest
   {
   public static void main( String [] args ) throws Exception
   {
      String encoding = args[0];
      byte[] b = new byte[256];
      for ( int i=0; i<256; i++ )
         {
         b[i] = (byte)i;
         }
      String x = new String ( b , encoding );
      for ( int i=0; i<x.length(); i++ )
         {
         if ( x.charAt( i ) != i )
            {
            System.out.println( i + " -> " + (int)x.charAt ( i ) );
            }
         }
   }
   }

Encoding Identification

Files are not marked with a signature to denote the encoding used. Further, the encoding it is not recorded externally is some sort of resource fork. You are just supposed to know what sort of encoding was used or track it by some ad hoc means. There are two exceptions.
  1. HTML files have the optional content-type meta tag to tell you.
    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
    
  2. XML files have the optional encoding parameter.
    <?xml version="1.0" encoding="ISO8859-1" ?>
You can make a guess by reading the text. The language gives a clue to the likely encoding used. The way common words are encoded gives a clue. Try looking at the document in various encodings and see which makes the most sense.

The Unicode little-endian or big-endian BOM (Byte Order Mark) is a strong clue you have 16-bit Unicode.

To automate the guessing, you could look for common foreign words to see how they are encoded. You could compute letter frequencies and compare them against documents with known encodings.

You might want to tackle this student project to solve the problem.

Learning More

armouring ¤ codepages ¤ Czyborra code charts ¤ IANA list of character set names ¤ IANA: (Internet Assigned Numbers Authority) ¤ ISO ¤ Ken Whistler's glossary to learn the vocabulary of character sets ¤ student project on encoding identification ¤ Unicode Byte Order Marks ¤ unicode


CMP logo
CMP_home
home
Canadian Mind Products CSS
HTML Checked!
ICRA ratings logo
mindprod.com IP:[24.87.56.253]
Your IP:[80.134.30.163]
You are visitor number 18584.
Please send errors, omissions and suggestions
to improve this page to Roedy Green.
You can get a fresh copy of this page from: or possibly from your local J: drive mirror:
http://mindprod.com/jgloss/encoding.html J:\mindprod\jgloss\encoding.html