## Monday 16 February 2015

### ASCII v EBCDIC

Sometimes, more or less at the same time, two different ways are invented for achieving the same thing.  They vie for market share for a while and then, often, there is a tipping point and one solution Harry Potters the other into oblivion.  Not because one is fitter-for-purpose or better engineered than the other, sometimes it's just the flap of a butterfly wing that causes the fatal reduction in diversity.  Remember Betamax video cassettes anyone?  You'll be ten years older than people who remember VHS, which won that battle.  Another example is the bizarre 4ft 8.5in gauge on British railways that was intrinsically worse than Brunel's 7ft GWR gauge but won the gauge wars by being the firstest with the mostest in building track.

A similar war was fought after I was born but before I learned to program. It all came back to me when I read Tim Hunkin's page on word-processing, part of which I've clipped [R] - this scheme for encoding letters is called ASCII (American Standard Code for Information Interchange). I don't know if Hunkin is Jewish but he's written the numbers/encoding Right-Left like Hebrew.  But you can see how the lower case letters are sequential  (in what follows, I've flipped the numbers so the smallest bit is on the right):
a 1100001; b 1100010; c 1100011 etc. a=97; b=98; c=99
The last two dots at the right of each row in the picture represent 25 (32) + 26 (64) = 96. Accordingly lower case 'a' 1100001 = 97 in decimal. ASCII is logical and has the capital letters encoded with consistency exactly 32 places lower than their lower case equivalents :
a 1000001; b 1000010; c 1000011 etc. A=65; B=66; C=67
You might expect that the numerals start the whole series off, because numbers is what computers 'do', but they don't: they start at 49:
1 0110001; 2 011010; 3 0110011 etc. 1=49; 2=50; 3=51
and, as you ask, zero: 0 0110000 24 (16) + 25 (32) = 48.  the lower codes are for a number of essential computer control features like
CR  013 0001101 (carriage return)
LF  010 0001010 (line feed)
BS  008 0001000 (back-space)
BEL 007 0000111 (bell that made the >!ding!< sound)
here the computer was being instructed to behave in ways that were mechanically achieved by its predecessor the old fashioned manual typewriter including the >!ding!<

This is all concerned with how letters and numbers, which humans can easily read, are represented by binary ON/OFF or 1s and 0s, which was all the subtlety that computers were able for. The internal consistency means that it is easy to convert lower-case to upper [subtract 32] and vice-versa [add 32] or determine if a word begins with a capital letter [if  65<First<90] and so treat it like a proper name.

In the post-WWII Computer Age, IBM was first out of the gate to implement a binary to letter/number/punctuation code for their early electronic computers. They realised that fitting everything (TWO alphabets, a clatter of punctuation, the ten digits and all that carriage-return infrastructure into 7 bits (0-127) would be tight and so used an 8-bit code called EBCDIC (Extended Binary Coded Decimal Interchange Code) which grew out of a minimalist BCDIC 6-bit code used on the very earliest computers that were programmed on stacks of punched cards or paper tape with holes in it.  6-bit only allows for 0-63 different symbols/functions and so never allowed lower case letters.  This is why early programming languages COBOL and FORTRAN are written entirely in UPPER-CASE and so read a bit "SHOUTY" to modern eyes. With an extra two bits to play with (0-255) EBCDIC spaced everything out and weirdly shoved the alphabet into three separated blocks
abcdefghi 129-137 10000001-10001001
jklmonpqr 145-153 10010001-10011001
~stuvwxyz 161-169 10100001-10101001