Java Glossary : Unicode

CMP home Java glossary home Menu no menu Last updated 2004-06-28 by Roedy Green ©1996-2004 Canadian Mind Products

Java definitions: 0-9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

You are here : home : Java Glossary : U words : Unicode.

Unicode
A 16-bit character encoding used in Java. See the glyphs, now in PDF format. Requires Adobe Acrobat to view. They come with cross referencing between similar characters. Sometimes called UCS or ISO 10646. Unicode allows Java to handle international characters for most of the world's living languages, including Arabic, Armenian, Bengali, Bopomofo, Chinese (via unified Han), Cyrillic, English, Georgian, Greek, Gujarati, Gurmukhi, Hebrew, Hindi (Devanagari), Japanese (Kanji, Hiragana and Katakana via unified Han), Kannada, Korean (Hangul via unified Han), Lao, Maylayalam, Oriya, Tai, Tamil, Telugu, Tibetan... Unicode will make it much easier for non-English speaking programmers to write programs for English speaking users and vice versa.

There are even codes for apple = '\uf000', British pound sign £ = '\u20A4', degree ° = '\u00b0' checkmark = '\u0271', dharma wheel = '\u2638', division = '\u00f7', euro = '\u20AC', female = '\u2640', heart = '\u2665', infinity = '\u221E', integral = '\u222B', male = '\u2642', pi = '\u03C0', PI = '\u03C0', sun = '\u2600', telephone = '\u269E' and trademark TM = '\u2122'.

There are also arrows: \u2013 \u2017 \u2101 \u2108 \u2190 \u2191 \u2192 \u2193 \u2194 \u2195 \u21A2 \u21AC \u21AD \u21B0 \u21B6 \u21C5 \u21CE \u21DC

In addition there all kinds of interesting special characters characters such as: Alphabetic Presentation Forms, APL, Arrows, Bengali, Block Elements, Box Drawing, Braille Patterns, Byzantine Musical Symbols, Combining Diacritical Marks, Combining Half Marks, Combining Marks for Symbols, Control Pictures -- icons for control chars, Currency Symbols, Dingbats, Enclosed Alphanumerics, General Punctuation, Geometric Shapes, Halfwidth and Fullwidth Forms, High Surrogates, Ideographic Description Characters, IPA Extensions, Letterlike Symbols, Low Surrogates, Mathematical Alphanumeric Symbols (32 bit Unicode), Mathematical Operators, Mathematical Symbols, Miscellaneous Symbols (astrology, chess, playing cards), Miscellaneous Technical (del, grad, integral), Musical Symbols, Number Forms (e.g. Roman numerals), OCR (Optical Character Recognition -- the OCR-A MICR characters used in magnetic ink cheque encoding), Old Italic, Runic, Small Form Variants, Spacing Modifier Letters, Specials, Superscripts and Subscripts, Tags (letters with price tags), Unified Canadian Aboriginal Syllabic and Variation Selectors.

Download Word For Windows document giving the full Unicode character set with Java, HTML and Postscript encodings. It is not as pretty as the PDF, but it is easier to search.

Nic Fulton of Reuters has written an Java Test Applet that can display all 64 thousand Unicode characters including the Chinese/Korean Han. How many of them actually display on your screen depends on the font handling ability of your browser and operating system, and which fonts you have installed. In Java programs, intractable Unicode characters are represented in the form '\uffff', with four hex digits. Ordinary characters like 'A' are actually 16-bit Unicode too.

How do you create and edit the various flavours of Unicode documents? You can create them in some specific encoding then convert them. To write a little utility to do that read up on encoding and ask the File I/O Amanuensis for sample code. You can use lowly Notepad in Windows NT/W2K/XP to edit existing documents but not earlier Windows versions. You would have to acquire an almost empty Unicode document for getting started with new documents. It is even clever enough to deal with byte order (endian) marks. Recent version of MS Word in Windows NT/W2K/XP also work.

Byte Order Marks

You can recognise Unicode files by their starting byte order marks, and by the way they are half zeroes.
Unicode Endian Markers
Byte-order mark Description
EF BB BF UTF-8
FF FE UTF-16 aka UCS-2, little endian
FE FF UTF-16 aka UCS-2, big endian
00 00 FF FE UTF-32 aka UCS-4, little endian.
00 00 FE FF UTF-32 aka UCS-4, big-endian.
There are also variants of these encodings that have an implied endian marker.

Glyphs

Unicode 16 and Unicode 32 Glyphs
in Downloadable Acrobat PDF Format
code Description code Description
0000 Basic Latin 25A0 Geometric Shapes
0080 Latin-1 Supplement 2600 Miscellaneous Symbols
0100 Latin Extended-A 2700 Dingbats
0180 Latin Extended-B 27C0 Miscellaneous Mathematical Symbols-A
0250 IPA Extensions 27F0 Supplemental Arrows-A
02B0 Spacing Modifier Letters 2800 Braille Patterns
0300 Combining Diacritical Marks 2900 Supplemental Arrows-B
0370 Greek 2980 Miscellaneous Mathematical Symbols-B
0400 Cyrillic 2A00 Supplemental Mathematical Operators
0500 Cyrillic Supplement 2B00 Miscellaneous Symbols and Arrows
0530 Armenian 2E80 CJK Radicals Supplement
0590 Hebrew 2F00 Kangxi Radicals
0600 Arabic 2FF0 Ideographic Description Characters
0700 Syriac 3000 CJK Symbols and Punctuation
0780 Thaana 3040 Hiragana
0900 Devanagari 30A0 Katakana
0980 Bengali 3100 Bopomofo
0A00 Gurmukhi 3130 Hangul Compatibility Jamo
0A80 Gujarati 3190 Kanbun
0B00 Oriya 31A0 Bopomofo Extended
0B80 Tamil 31F0 Katakana Phonetic Extensions
0C00 Telugu 3200 Enclosed CJK Letters and Months
0C80 Kannada 3300 CJK Compatibility
0D00 Malayalam 3400 CJK Unified Ideographs Extension A
0D80 Sinhala 4DC0 Yijing Hexagram Symbols
0E00 Thai 4E00 CJK Unified Ideographs
0E80 Lao A000 Yi Syllables
0F00 Tibetan A490 Yi Radicals
1000 Myanmar AC00 Hangul Syllables
10A0 Georgian D800 High Surrogates
1100 Hangul Jamo DC00 Low Surrogates
1200 Ethiopic E000 Private Use Area
13A0 Cherokee F900 CJK Compatibility Ideographs
1400 Canadian Aboriginal Syllabic FB00 Alphabetic Presentation Forms
1680 Ogham FB50 Arabic Presentation Forms-A
16A0 Runic FE00 Variation Selectors
1700 Tagalog FE20 Combining Half Marks
1720 Hanunoo FE30 CJK Compatibility Forms
1740 Buhid FE50 Small Form Variants
1760 Tagbanwa FE70 Arabic Presentation Forms-B
1780 Khmer FF00 Halfwidth and Fullwidth Forms
1800 Mongolian FFF0 Specials
1900 Limbu 10000 Linear B Syllabary
1950 Tai Le 10080 Linear B Ideograms
19E0 Khmer Symbols 10100 Aegean Numbers
1D00 Phonetic Extensions 10300 Old Italic
1E00 Latin Extended Additional 10330 Gothic
1F00 Greek Extended 10380 Ugaritic
2000 General Punctuation 10400 Deseret
2070 Superscripts and Subscripts 10450 Shavian
20A0 Currency Symbols 10480 Osmanya
20D0 Combining Marks for Symbols 10800 Cypriot Syllabary
2100 Letterlike Symbols 1D000 Byzantine Musical Symbols
2150 Number Forms 1D100 Musical Symbols
2190 Arrows 1D300 Tai Xuan Jing Symbols
2200 Mathematical Operators 1D400 Mathematical Alphanumeric Symbols
2300 Miscellaneous Technical 20000 CJK Unified Ideographs Extension B
2400 Control Pictures 2F800 CJK Compatibility Ideographs Supp.
2440 Optical Character Recognition E0000 Tags
2460 Enclosed Alphanumerics E0100 Variation Selectors Supp.
2500 Box Drawing F0000 Supplementary Private Use Area-A
2580 Block Elements 100000 Supplementary Private Use Area-B


CMP logo
CMP_home
home
Canadian Mind Products CSS
HTML Checked!
ICRA ratings logo
mindprod.com IP:[24.87.56.253]
Your IP:[80.134.30.163]
You are visitor number 52866.
Please send errors, omissions and suggestions
to improve this page to Roedy Green.
You can get a fresh copy of this page from: or possibly from your local J: drive mirror:
http://mindprod.com/jgloss/unicode.html J:\mindprod\jgloss\unicode.html