The primary function of XML is to consume RAM and datacommunication
bandwidth. Presumably it was promoted to its current frenzy by companies who
sell either RAM or bandwidth. Others promoting it have patents they hope to
spring on the public once it is entrenched. XML is the biggest con game going in
computers. You probably guessed, I am known for my rabid dislike of XML.
XML is the Extensible Markup Language, a W3C proposed
recommendation. Like HTML, XML is based on SGML, an International Standard (ISO
8879) for creating markup languages. However, while HTML is a single SGML
document type, with a fixed set of element type names (AKA "tag names"),
XML is a simplified profile of SGML: you can use it to define many different
document types, each of which uses its own element type names (instead of HTML's "html
", "body", "h1", "ol", etc.). For example, in
XML, you can markup an online transaction like this:
<?xml version="1.0" standalone="no"?>
<!DOCTYPE order SYSTEM "order.dtd">
<order>
<invoice-number>12345</invoice-number>
<customer>Wile E. Coyote</customer>
<date>1997-12-14</date>
<item>
<name>Jet-Propelled Roller Skates</name>
<catalog-number>345-678-9</catalog-number>
<quantity>2</quantity>
</item>
<item>
<name>100,000-pound Weight</name>
<catalog-number>987-654-3</catalog-number>
<quantity>1</quantity>
</item>
</order>
Just like HTML, comments begin with <-- and end with -->.
You can abbreviate <mytag></mytag> as <mytag
/>. Just like HTML, various characters are reserved and have long
forms called entities to use when they occur
accidentally in the text as data: &, <,
>, ' and "
Character references take one of two forms: decimal references, ℞
and hexadecimal references, ℞. Unicode is
always presumed.
You describe your little XML subgrammar by writing a DTD (Document Type
Definition) file. Optionally, you can include the DTD inline inside your
XML file.
There are two popular parsing techniques, SAX
(Simple API for XML), which hands you each field as it
parses, and W3C DOM (Document Object Model) tree which
creates a complete parse tree you can prune and repeatedly scan.
I personally detest XML, however, it has caught on like a cocaine wave. It must
have some redeeming features.
XML Benefits
-
It is relatively easy to whip up a DTD to describe an XML grammar for some
little data file. That DTD is all you need to generate a parser.
-
The XML files can be viewed or composed by humans using a text editor.
-
XML is about as simple a grammar as you can get.
-
XML can work with almost any 8-bit or 16-bit character set.
-
XML is good at handling hierarchical data.
-
You can have Pick OS-like data, with arbitrarily long fields, and arbitrarily
repeated fields.
-
XML is platform independent. It has no big-little endian problems.
-
It is possible to parse XML without writing a DTD. This process presumes the
grammar is perfect.
-
XML search engines can take into account the tag context, e.g. "Washington"
inside tag <state>, <president>, <mountain>, <moviestar>.
An XML search engine can show you want tags in found and let you choose the
relevant ones.
-
XML settles on Unicode character encoding to allow transmitting data in any
language, though it does require clumsy entity encoding/decoding.
-
A program does not need to understand the entire structure of a file. It can
just pick out the tags of interest. This means new tags can be easily added
without disturbing existing software that uses the file.
XML Drawbacks
-
XML is incredibly fluffy and repetitive. It wastes bandwidth in transmission.
You must compress it. Happily, ZIP-style compression works very well on
XML. Unfortunately, you have to fluff it back up to process it, wasting RAM with
unprecedented abandon. In practice no one does compress it.
-
It takes up huge amounts of RAM and disk space to store it.
-
The DOM parse tree considers every space significant, even spaces between
tags, even spaces for indenting, even trailing spaces on a line, even double
spaces embedded in data.
-
There is no mechanism to describe the types of the data. To XML, everything is a
string. There is no way to specify a field must be numeric, that in needs two
decimal places, that it must represent a date in some range, that it must not
have accented letters, that it be restricted to certain punctuation, or be one
of a certain set of legal values. There are scores of tack-ons trying to fix
this and other shortcomings turning the simple XML into a tower of Babel.
-
You can't use the XML files directly, they need to be parsed first. Perhaps some
day there will be pre-parsed, compact, computer-friendly versions of XML. I have
heard rumour such a beast called XMLC has been proposed.
-
It uses HTML's fluffy system of entities such as
-
There are a raft of recommendations surrounding XML, such as XPath, XPointer,
XSL, CSS, XLink and so forth. In the pipeline are XHTML, Metadata and Namespaces
and a Schema system. XML is fast becoming very complicated, because it is not
really standalone. You need added extras to make it usable. Competing standards
will have to fight it out. The #1 reason XML caught on was its raging-idiot
simplicity. Now it has not even that advantage.
-
XML advocates say "Memory is cheap and bandwidth is cheap, so what the hell,
let's squander it." However, this is not true with handhelds. Memory
consumes battery power, the main limit today of handheld capabilities. Bandwidth
consumes radio air time and battery time. We are running out of broadcast
frequencies. You can't manufacture more of them once the channels are filled,
just use them more efficiently. Further, the delays caused by bloated XML
packets consume precious people time, and frustrate the heck out of users
completely needlessly.
-
In an Applet or a hand held device, memory for data and code is at a premium.
You normally carefully massage the data offline to be as predigested and as
compact as possible, e.g. serialised objects. As well as being fat, XML needs
considerable processing before it can be used. This consumes RAM for both data
and code, and battery power to do the massaging.
-
There is no standard way to compress XML. You can use ZIP which is very cpu and
ram heavy. You can use WBXML (Wireless Binary XML). The
problem is on receipt, it is fluffed back up to regular XML then parsed, so it
is has even more parsing overhead that regular XML. There are other compressed
formats ASN and WML. In practice most XML gets sent in its outrageously fluffy
default form. People think XML files are always tiny little 1K configuration
files and so why worry. The point is once a format gets established, it gets
used for all sorts of things the originators would never have dreamed of, like 3
gig image files.
-
XML uses a ugly syntax with gratuitous punctuation. #IMPLIED
really means optional. #PCDATA
means string <!ATTLIST
means attributes.
-
There are no standard tag names for XML. Everyone still codes postal addresses
differently which means data exchange still requires custom coding. RDF
ontologies address this problem.
-
 | The Theory Of The Leisure Class |
| 0-14-018795-2 |
| Thorstein Veblen |
| Published in 1899. Highly amusing. He coined the terms conspicuous consumption and conspicous waste to explain modern status displays. |
|
|
XML is an example of conspicous waste, waste for
waste's sake. I find it morally repugnant. I reminds me of Roman Emperor
Caligula who took a bite of a peach, tossed it away, then grabbed a fresh one.
The authors went out of their way to create a bloated, ugly syntax.
Using XML to transmit data is the analog of insisting that all code be passed
around as triple spaced Java source files, with added dummy comments, rather
than as binary byte code. There is no guarantee a source file is even
syntactically correct. It is impossible to create a syntactically incorrect byte
code file. Byte code files can be processed without time-consuming parsing. In
byte code, repeating strings are naturally specified only once. XML, as it
stands, suffers from all those analogous drawbacks and more.
What Should Replace XML?
The characteristics include:
-
It needs to be a binary format for compactness. Files have to both be
transmitted and stored. Size does matter. People think in terms of one page XML
files, but they potentially could be gigabytes long. If XML becomes an
established interchange format we will pay for the slop in XML trillions of
times over. It is not good enough to say XML files will always be stored in
compressed form. In my experience in practice XML files are never compressed.
Files should be both compact and quick to process. XML as it stands is neither.
-
It needs to be a binary format to ensure correctness. Human readable formats
tempt people to manually compose documents that are almost syntactically correct,
e.g. HTML. This is too sloppy for an interchange format. Consider how much
better chance you have of getting a working program first time if someone sends
you java byte code rather than Java source that may not even compile.
-
It needs to be computer-friendly so that a program can rapidly find the data it
wants without having to parse for delimiters of various flavours. If people want
to examine the file detail for debugging, let them use a binary reader/editor.
You could use counted strings rather than delimited strings and use integers to
encode the field types so they can be used directly as table indexes. I would
not go quite so far is to ask for a serialised tree of nodes, but push for a
representation that can rapidly be turned into one.
-
For giant files, the representation should not have substantially more overhead
than the raw binary. There need to be ways of efficiently expressing repeating
patterns. For example, there is no need for delimiters for fixed length data.
There is no need for individual field identifiers for standard groupings of
fields. You want to push as much as possible of the file format description into
the descriptor file, out of the data file. The descriptor file need be
transmitted only once. The data file will typically be transmitted again and
again. There is no need to make the format simple, just compact and fast to
process. All you need is a simple programmer's interface to it. Only a
handful of programmers ever need concern themselves with its inner structure.
-
XML currently only allows for hierarchical trees of data. There are one or two
other types of data out there in the world, (e.g. tables, relations, references,
graphs) A universal interchange format should be a little more flexible. If it
is worth doing, it is worth doing right. Obviously the format can't be expected
to handle every conceivable data structure and obsolete every specialised
interchange format ever devised. However, XML is talking big about becoming
universal and should deliver. It can't even handle ordinary business data which
is typically relational not strictly hierarchical.
-
One possible example of the sort of inner structure I am thinking of is my HTML
compactor project.
-
The other thing it needs is in the DTD some information about the allowed data
types, there need to be the usual bounded ints, IEEE floats, IEEE doubles, 8-bit
encoded strings in some reasonably small number of character sets, with maximum
and minimum lengths, as well as a variety of business types, such as zip, zip+4,
state, country, Canusan phone, international phone, date, time, credit card
number, latitude, longitude, etc. When someone is handing you data you need to
know how clean it is. You need to know ahead of time the minimum and maximum
enforced limits on various field sizes.
-
Ideally the new binary format, or a variant of it would also handle the function
HTML does now. This would, in a stroke, give four benefits:
-
Much more compact transmissions, which means much faster transmissions and
lighter loaded servers.
-
No more syntax errors. In the process of converting to binary format all syntax
would either have to be manually or automatically corrected. This means the
browser no longer has to deal with both the official standard, and also all the
common variant errors that people type. This means pages would always render
properly. As it is, pages render properly only in the browser used by the author
which forgives his particular errors. The binary protocol effectively blocks
human HTML coding errors from getting out on the net.
-
Faster rendering since the data would arrive already preparsed. The browser
would know for example how big tables are before it had finished reading the
entire file, and so could start rendering the top part of the document
accurately immediately.
-
Consider the total dollars invested in equipment in the world to transmit HTML,
including servers, satellite links, fibre optic links, cable connections... In a
stroke, you would double the capacity of that equipment to deliver HTML, simply
by switching to a binary delivery format.
One possible candidate for the XML replacement job is the Java serialised object
format. It can handle just about any data structure imaginable. It is platform
independent. It has a simple DTD -- Java source code for the corresponding class.
Some claim it is Java-only. Not so. It is no more difficult for C++ to parse
than any other similar newly concocted protocol. It is not tied to any hardware
or OS. It is just that Java has a head start implementing it. Java can implement
it with no extra overhead.
There have been some efforts made to patch up the shortcomings of XML, in fact
there are dozens of them. XML is no longer simple any more. It is raggedy
patchwork quilt. People were sucked in by the initial simplicity, then
discovered that it was not really all that useful in its simple form. Schema was
added to allow specifying types (but still only permitting strings). Yes we need
a standard interchange format, but XML was only a back of the envelope stab at
it. XML was destined to fail since it totally ignored so many factors in coming
up with a good design.
One such effort is VTD Virtual Token Descriptor (VTD). A
VTD record is a 64-bit integer that encodes the starting offset, length, type
and nesting depth of a token in an XML document. Because VTD records don't
contain data fields, they work alongside of the original XML document, which is
maintained intact in memory by the processing model.
Due to the stupidity, duplicity and/or greed of those promoting XML, we will
likely be stuck with some committee-patched variant of it forever -- something
that will make even HTML look clean. We need a common data interchange format,
but not so inept.
DTD
You need to compose a DTD file that describes the format of the XML file. The <!ELEMENT
statement is used to list the various tags you will use, and which tags may be
used inside which tags, and how often and in which order. The <!ATTLIST
statement is used to list the various attributes (mandatory and optional) of
each tag. The <!ENTITY statement lets you make up
you own abbreviations.
Here is a simple example:
DTD:
<!ELEMENT square EMPTY>
<!ATTLIST square width CDATA "0">
The CDATA means the value of the field is a string.
XML:
<square width="100"></square>