Science matters: UniKanji

Y'all know that all the words you read on The Blob are stored somewhere on The Cloud as bytes made up of several 1 or 0 bits; because computers are too stupid to count above 1 2 1 2 1 2. In the rough and tumble early days of computing there was a bit a of war [ASCII v EBCDIC] about the details of how letters would be mapped to bytes and ultimately to on/off, 1/0. That was all fine when programmers all spoke English or at least used the Latin alphabet. When Russians and Greeks and Tibetans started to digitise their world, an alphabet of single 8-bit bytes wouldn't do it and so UniCode was created to allow 'foreign' letters, accents and punctuation squiggles to be written and read by using up to 4 x 8-bit bytes. You don't want to use 4 bytes for every letter, so Unicode cleverly has bytes which say "the next three bytes should be read with this one". That's like telephone numbers: 00 indicates the next set of numbers are pointing at Dublin, New Hampshire rather than Dublin, Dublin. UniCode's UFT-8 standard allows for 1,112,064 different glyphs. I've written about how getting your UniCodes in a tangle can be a matter of life and death.

What about kanji? If you think you have it hard mastering English spelinge and apo'strophes, what about being required to learn several thousand characters borrowed from Chinese in order to be able to read the newspaper? Even for the pre-digital age, can you imagine being a type-setter in Japan? would your arms be long enough to reach the bins with the less-used kanji? Like everyone else, the Japanese have gone digital and have been allocated a chunk of UniCode territory to map their world to The Cloud. It is called the JIS-X-0208 aka 7ビット及び8ビットの2バイト情報交換用符号化漢字集合 and uses nearly 7,000 double-bytes to uniquely identify all the characters that are now, or ever have been, used to label words, locations and personal names. You can have fun with Google translate trying to parse 7ビット = nana-bitto = 7-bit . . . 2バイト ni-byto 2-byte. . . 情報 Jōhō = information. Each kanji is mapped to a row and column in Unicode space and a clever graphic designer is tasked to render it into pixels. Here is the 59th row

There is some sort of logic to it: the consecutive kanji 杼杪枌枋枦枡枅 all share the same radical 木 on the left. In fact if you look carefully you'll see that more than half the kanji on row 59 share that attribute.

In 1978, when the Japanese government embraced the JIS-X-0208 standard, they required minions to read all kinds of different sources: birth records, gazeteers, maps, books, and scrolls to identify all the kanji which had been used somewhere, sometime and would therefore require their own place in the UniCode sun. After diligently working through the canon of info, copy-editors, users and readers noticed some oddities in the lists, which nobody could identify, let alone source or pronounce. They became known as ghost characters. Twenty years later, some officious cleaner-upper in the bureaucracy launched an investigation to find out where and how these zombie characters has arisen and kill them if required. Of course, the original compilers had included a source: but sometimes that source was super-unspecific: "Overview of National Administrative Districts" a formidable document which runs to 6000 closely printed pages.

We've just come back from a road-trip in England where place names include Newcastle-upon-Tyne, not to be confused with Newcastle-under-Lyme, Stoke-on-Trent, Grange-over-Sands, Bradwell-juxta-Coggeshall. In Japan they have the same sort of thing. One place is called "山 over 女" which they captured by pasting the two characters one above the other to give ghost-char 妛: the extra middle stroke resulting from the fuzz where the two characters were glued together. That misunderstanding was caught, propagated and became concrete by continued reproduction despite its meaning being lost. I find that quite delightful: like how a science-anxious lexicographer saw "D or d; n. density" and crammed them together to coin Dord n. density.

Science matters

Thursday, 23 August 2018

UniKanji

No comments:

Post a Comment