Monday 16 February 2015

ASCII v EBCDIC

Sometimes, more or less at the same time, two different ways are invented for achieving the same thing.  They vie for market share for a while and then, often, there is a tipping point and one solution Harry Potters the other into oblivion.  Not because one is fitter-for-purpose or better engineered than the other, sometimes it's just the flap of a butterfly wing that causes the fatal reduction in diversity.  Remember Betamax video cassettes anyone?  You'll be ten years older than people who remember VHS, which won that battle.  Another example is the bizarre 4ft 8.5in gauge on British railways that was intrinsically worse than Brunel's 7ft GWR gauge but won the gauge wars by being the firstest with the mostest in building track.

A similar war was fought after I was born but before I learned to program. It all came back to me when I read Tim Hunkin's page on word-processing, part of which I've clipped [R] - this scheme for encoding letters is called ASCII (American Standard Code for Information Interchange). I don't know if Hunkin is Jewish but he's written the numbers/encoding Right-Left like Hebrew.  But you can see how the lower case letters are sequential  (in what follows, I've flipped the numbers so the smallest bit is on the right):
a 1100001; b 1100010; c 1100011 etc. a=97; b=98; c=99
The last two dots at the right of each row in the picture represent 25 (32) + 26 (64) = 96. Accordingly lower case 'a' 1100001 = 97 in decimal. ASCII is logical and has the capital letters encoded with consistency exactly 32 places lower than their lower case equivalents :
a 1000001; b 1000010; c 1000011 etc. A=65; B=66; C=67
You might expect that the numerals start the whole series off, because numbers is what computers 'do', but they don't: they start at 49:
1 0110001; 2 011010; 3 0110011 etc. 1=49; 2=50; 3=51
and, as you ask, zero: 0 0110000 24 (16) + 25 (32) = 48.  the lower codes are for a number of essential computer control features like
CR  013 0001101 (carriage return)
LF  010 0001010 (line feed)
BS  008 0001000 (back-space)
BEL 007 0000111 (bell that made the >!ding!< sound)
here the computer was being instructed to behave in ways that were mechanically achieved by its predecessor the old fashioned manual typewriter including the >!ding!<

This is all concerned with how letters and numbers, which humans can easily read, are represented by binary ON/OFF or 1s and 0s, which was all the subtlety that computers were able for. The internal consistency means that it is easy to convert lower-case to upper [subtract 32] and vice-versa [add 32] or determine if a word begins with a capital letter [if  65<First<90] and so treat it like a proper name.

In the post-WWII Computer Age, IBM was first out of the gate to implement a binary to letter/number/punctuation code for their early electronic computers. They realised that fitting everything (TWO alphabets, a clatter of punctuation, the ten digits and all that carriage-return infrastructure into 7 bits (0-127) would be tight and so used an 8-bit code called EBCDIC (Extended Binary Coded Decimal Interchange Code) which grew out of a minimalist BCDIC 6-bit code used on the very earliest computers that were programmed on stacks of punched cards or paper tape with holes in it.  6-bit only allows for 0-63 different symbols/functions and so never allowed lower case letters.  This is why early programming languages COBOL and FORTRAN are written entirely in UPPER-CASE and so read a bit "SHOUTY" to modern eyes. With an extra two bits to play with (0-255) EBCDIC spaced everything out and weirdly shoved the alphabet into three separated blocks
abcdefghi 129-137 10000001-10001001
jklmonpqr 145-153 10010001-10011001
~stuvwxyz 161-169 10100001-10101001
Written like that you can see where they were coming from: each alphablock is 24 (16) higher than the previous one.  But this is a total pain in the tits for coding: you can't ask one question to determine if a character is part of a word; you must ask three separate questions.  But EBCDIC was the way that the phenomenally successful IBM/360 series of business computers operated, when launched in 1964, and so they nearly swept the board with their clunky choice of encodings.  When I was programming an IBM/370 in 1980, I had to get my head around EBCDIC encoding but I also had to get to grips with ASCII which had started slimmer (7-bit for each character) and was out-running the EBCDIC monster. ASCII's inadequacies as to insufficient slots to have açcéñted léttêrs or Greek and Cyrillic characters was addressed when it was expanded into UniCode and even IBM have long ago signed up to that international standard. 85% of what you read on the WWW is written in UTF-8 which combines the efficiency of ASCII with the universality of Unicode by encoding each letter with 1,2,3or4 8-bit bytes.  If the text is straightforward American it all compacts into one byte; want accented letters you foreign-johnnies? - add an extra byte; want to write in Hindi - add another byte.  UTF-8's 1,112,064 different letters will suffice until we start talking to other planets. 

No comments:

Post a Comment