After reading this article, you will know why letters of many languages of the world are represented by two bytes of memory, and English letters are represented by only one byte of memory and much more. I hope this article will be interesting for you.
- Disadvantages and advantages of single-byte encodings
- Unicode and its benefits
- How does a computer interpret Unicode characters?
- Unicode disadvantages and Quick Search Algorithm
- Programming language and encodings
- Using Python encoders and the Encodings module
- Iterative and incremental codecs
- Base64, Big5 and Monkeys on YouTube
As mentioned in previous articles, any distinguishable state can be used to encode information. There are only 2 of such states in electronic devices: "on" and "off". The English alphabet has 26 letters, which means that 5 bits are enough to represent each letter. The total number of states that can be interpreted is 2 ^ 5 = 32. If we will consider capital and small letters, then for 52 characters 6 bits are required.
However, in addition to the basic printable characters, there are also control characters. Each time you press the enter key while typing text, you insert an invisible end-of-line control character. On Windows operating systems, the end of a line is indicated by two characters. Microsoft guys like to take up space on your hard drive. Of course, this is a joke, just old printers were required the presence of the carriage return character.
Seven light bulbs are enough to encode an English letter and some other additional characters that do not have a graphic representation. Instead of light bulbs, anything can be used to encode information. For the presentation of information, you can use any physical process of n states. The computer is built primarily of semiconductor elements and therefore there are only two states. One bit - 2 ^ 1 states, one byte (8 bits) - 2 ^ 8 = 256 states, two bytes - 2 ^ 16 = 65536 states.
However, even if you use ASCII encoding to represent characters, storing of one character requires 8 bits. This is due to the size of the CPU registers, because it is more convenient to process information. If you like animated illustrations and you want to print a poster for a computer class, then you can download it from the openclipart website (poster1, poster2).
The eighth bit in ASCII characters is used to represent characters from other alphabets and pseudographic elements, such as: dashes, squares, circles, and so on. In Windows operating systems, such encodings are called as follows: Windows-XXXX. Instead of the characters XXXX, the number assigned to one of the national alphabets, or its variations, are used. Encoding Windows-1251 represents the characters of the Russian alphabet. On Linux operating systems, this encoding is called CP-1251. CP is the English abbreviation for "code page". Characters of the considered encodings are 1 byte in size.
Disadvantages and advantages of single-byte encodings
Encoding is a state interpretation table. Since the eighth bit is used, 7 bits are left to represent characters from other alphabets. Therefore, it is possible to represent 2 ^ 7 = 128 characters, letters and signs. Imagine that you live in Cambodia and have 74 letters in your alphabet. In addition to them, there are also other symbols and signs that also need to be presented. It is clear that one byte is not enough.
Some countries such as the Czech Republic, Slovakia, Hungary, Poland, Romania are using Window-1250 encoding for single-byte characters. If a Russian-speaking user wants to see a Romanian website that uses this encoding to represent characters, then he will see gibberish on the screen that does not contain Romanian characters. Highly likely, this text will be interpreted by the browser as encoding Windows-1251.
You can create a lot of encodings, but it is very important that those who read the written text can interpret it. To support single-byte encodings, modern browsers have the ability to manually select it. But it is very inconvenient and time consuming. Modern people are not used to spending time on such operations.
And yet, they are still used. So the IBM DB2 database management system allows you to define single-byte encodings for text fields (Windows-1251, Windows-1250, and others). Other DBMSs also allow you to do these things. The main advantage of single-byte encodings is the size of a single character. This allows you to effectively manage the place to store information, especially when there is a lot of text data. However, this does not apply to the management of public facilities. It is no secret that all countries exchange information among itself. There are special commissions at the UN which are responsible for collecting statistical information. For example, the UN trade commission (UN Comtrade).
In addition to all the above, there are Open Data projects for the implementation of transparent government management. The meaning of these projects is that every citizen can see how decisions are made by each individual official and, if he is not satisfied with management decisions, remove him from management. According to the requirements of ISO and ICAO, computer networks and systems that implement the solution of tasks of such kind should store information in Unicode.
Unicode and its benefits
In order not to switch between code pages, the Unicode standard was invented. Unicode consortium is responsible for its development. The most popular, now is the Unicode character representation method - UTF-8. For efficient data storage, it uses different sizes of characters and letters: 1 byte, 2 bytes, 3 bytes and 4 bytes. This standard is compatible with the ASCII standard. The first 128 characters are English letters, numbers, punctuation characters and control characters that we considered at the very beginning. Each letter in Russian, Belarusian, Polish and many other languages has a size of two bytes. This means that for 4096 characters of the English alphabet, an englishman will spend 4 KB, and residents of other countries will spend 8 KB of memory. Do not be offended, just continue to live on.
Two bytes of memory are enough to represent 65536 different characters. The set of such symbols includes alphabets of most world languages and specific signs such as: Armenian sign of eternity arevakhach, arrows, mathematical symbols, and so on. Rare and not often used characters are 3 bytes in size. The set of such symbols includes: the "№." sign, rare Chinese characters, the Slavic asterisk, the Cyrillic multi-eyed letter "o (ꙮ)", and the symbols of the ancient Slavic language. Emojy, runes and additional characters are 4 bytes in size. And this is very sad. Storing a text consisting of 4096 runes will require 16 KB, which is 4 times more than in English text!!!
How does a computer interpret Unicode characters?
How does the browser and other applications know how many bytes a character takes? Applications are processed all characters byte by byte. If the first byte has a numeric value greater than 192 (that is, if the two significant bits on the left of the first byte are 1), then the character is represented by two bytes and the two significant bits in the second byte are 10. If there are three significant bits are set, then the character consists of three bytes. A table describing the principle of interpretation of Unicode is presented below.
|3||111XXXXX 10XXXXXX 10XXXXXX|
|4||1111XXXX 10XXXXXX 10XXXXXX 10XXXXXX|
Instead of the letter "X" you can put 1, or 0. At the very beginning, the Unicode standard allowed the use of characters of 6 bytes in size (6 binary octets) .
In 2003, the situation changed and only 4 octets are allowed for user needs. This is quite enough for all world languages and even for Emoji’s pictograms and symbols. Perhaps, in the distant future, the amount of Emoji will increase and each person will have his own Emoji characters, then 6 octets will be very useful .
Unicode disadvantages and Quick Search Algorithm
The more bytes of information, the more difficult and slower the search is. Members of the RFC standards development community have considered all possible difficulties and found an effective solution. The Boyer-Moore algorithm (heuristics of bad characters) allows you to quickly find the text information represented through Unicode characters. This algorithm applies your browser when you press Ctrl + F. The research interests of Boyer-Moore included the study and construction of finite digital automata.
Let the text length in which the search is performed is equal to N, and the length of the search pattern is equal M. The phase of initial calculations will take O (M^2 + σ) operations, where σ is the size of the used alphabet. Then, in the best case, the execution of the algorithm will take O (N / M) comparisons. In the worst case, when the text consists of the same characters, for example 100 letters "C", in which the search for 5 letters "C" is performed, then the operation of the algorithm will take O (N * M) comparisons.
In terms of security, it is very important to avoid buffer overflow errors. In 2001, there were cases of using forbidden octet sequences and incorrect parsers led to serious security problems. Therefore, when writing your Unicode character handler, pay attention to the RFC standard [RFC 3629, clause 10].
Programming language and encodings
Some people remember how problematic the interpretation of Unicode characters was in ancient programming languages. Moreover, there were a huge number of the same ancient programs that did not support this encoding. If instead of letters you see small squares, then this may mean the following:
- There is no Unicode support;
- The selected font has no graphic representation for this character.
There are practically no cases when Unicode support is missing. All programming languages and modern applications support this encoding by default. But not all fonts have a graphic representation for Emoji’s pictograms and symbols. This should be considered when developing programs that working with text data.
Using Python encoders and the Encodings module
Python is a cross-platform language ported under Windows, Macos, Android. Most Linux distributions have it installed by default. On Windows OS, run Python Idle to use the interpreter. You can also install the integrated development environment Visual Studio Community, or PyCharm, to simplify development in this programming language. In terms of ease of use, the Python language ranks first. It is used by scientists, network engineers and enthusiasts. It is simple and can be applied step by step in the interpretation mode. To view the help for the Encodings module, directly from the interpretation mode, in turn, enter two lines.
- import encodings
If you decide to create your own codec, read also the codecs module help. Import the codecs module and also use the help() method. You can create your own codec and register a special method to search for information. To view a list of all available methods (functions) of a module, enter the dir(encodings) command. Be careful if you have not entered the "import module_name" directive, then the dir method will return an error:
NameError: name ‘encodings’ is not defined
In many languages there is a function ord(), which allows you to find out the character code in the decimal numbering system. You can get codes for multiple characters, not only for one character, using codecs. For the hexadecimal representation, apply the codec "hex_codec", as in the figure below.
This encoder received, as an argument, a string that consists of 5 initial letters of the English alphabet and 5 initial letters of the Russian alphabet. You can find out the length of this string using the len() function, but the result value needs to be divided by 2, since one binary octet is represented by two hexadecimal digits (from 0 to 'f'). As you can see, this string is 15 bytes in size: 5 bytes of English letters and 10 bytes of Russian letters. Please note, that the first byte in Russian characters "d0" and the English letter "A" differ in numerical value from the Russian letter "А". The int() method converted the hexadecimal value for the English letter "A" to decimal 65. To convert to the binary system, use the bin() function.
Important note! After copying text from an application that saves files in the Windows-XXXX (CP-XXXX) format to the Python interpreter, it automatically converts the text to UTF-8. If you want to find corespondent character codes in other encodings, take the hexdump utility (any other hex editor will do) and open the file saved in another encoding, for example in CP-866, as in the figure above.
The punycode encoder allows you to represent Unicode symbols via ASCII characters. You can revert the encoded string to its original state using the decode() method. To exit the interpretation mode, type the function quit().
Iterative and incremental codecs
There are two types of encoders: iterative encoders and incremental encoders. If the encoder processes characters which length is known in advance, then it is called iterative. This type of encoder includes an encoder: ASCII and other types of encoder based on it, for example ROT-13. An example of the use of the ROT-13 encoder is shown in the figure below.
As you can see, it works only with ASCII characters and if, as an argument to write a Russian letter, it gives an error. To get rid of the error, remove all characters which numeric code exceeds 127. Thus, only ASCII characters will remain.
As mentioned earlier, Unicode characters have different lengths. Therefore, they cannot be processed by iterative encoders. To process it properly, incremental encoders are used which contain a counter variable. The purpose of the counter is to count the number of features (significant bits) that will determine the number of bytes in for each character. As such features, the significant bits of the first byte of the Unicode sequence are used (see the table above).
From the point of view of computer system security, incremental encoders that do not check the number of characters are as dangerous as arrays. Be careful, adhere to the principles of safe programming. Program development is a big responsibility and cannot be neglected. It is not enough just to be a coder, you need to have knowledge and experience in the field of computer security.
Base64, Big5 and Monkeys on YouTube
Do you know that any program can be transmitted as text? Some sites prohibit downloading executable files in * .EXE format, or binary files. Some email clients may skip emails with attached executables. To avoid this, you can convert the contents of the executable file into text which alphabet is 64 ASCII characters. This encoding is called Base64. The size of the file encoded in this way will be larger than the size of the original executable file.
There are some funny moments associated with using Base64. If you use Youtube, then you could see the error pictured above. As you can see, this is very similar to Base64 encoding, but the slash and plus symbol is missing. Base64 encoding is available in several variants. The picture above shows the "urlsafe Base64" encoding. Symbols 62 and 63 ("+" and "/") are changed to "-" and "_". After applying the Base64 codec, you can get the encrypted contents of your browser's stack. Funny monkey, in the picture above, provokes to share this screenshot with friends and add the hashtag #youtubemonkeys. Do not rush to laugh, it is not self-irony. Youtube engineers regularly search for similar screenshots. Decode its content as base64 and then decrypt the secret contents of your browser stack and find the fault. If they asked you to share the contents of the browser stack, you probably would have refused. Isn't it?
For the representation of Chinese hieroglyphs, sometimes are not enough standard hieroglyphs. In Hong Kong, are commonly applied additional hieroglyphs. For its input Big5 encoding is invented. In addition to this encoding, Python supports other exotic encodings that are included in the ISO standards .
We considered the standard ways of representing characters. Despite the fact that in a text editor you see a single character, it can have a different size in bytes. There are many encodings, but Unicode dominates among it. The vast majority of applications and websites use this encoding.
Some letters have the same shapes, but different numerical codes. This feature can be used for steganographic purposes. We considered how to see the code of each symbol: 1) by using the Python programming language; 2) by using hexdump utility.
Sources of information:
 RFC 2279, "UTF-8, a transformation format of ISO 10646", an obsolete version.
 RFC 3629, «UTF-8, a transformation format of ISO 10646».
 ISO 10646, Стандарт Unicode.