In the web page on Basic Cryptography, we very quickly gloss over the fact that, in ASCII, the character A is represented by the number 65, or in binary, 01000001. This fact makes sense to people who are familiar with binary and ASCII, but it probably makes little sense to others.
Binary
There are a number of different ways of representing numbers. When we read the number 230, we think "two hundred thirty". This is because we are most familiar with a number system that is based around the number 10. "230" is just a shorthand representation for 2*100 + 3*10 + 0. We say this number system (known as base-10) is based on the number 10 because 2*100 + 3*10 + 0 can also be written as: 2*102 + 3*101 + 0*100; each digit in the number represents a power of 10. Can we write the same number as the sum of powers of 2 instead?
230 = 128 + 64 + 32 + 4 + 2
= 27 + 26 + 25 + 22 + 21
When we write 230 as a base-10 number, we keep only the digits in front of every power of 10: the two, three, and zero. What are the digits in front of each of our powers of two? The sum can be expanded to read:
1*27 + 1*26 + 1*25 + 0*24 + 0*23 + 1*22 + 1*21 + 0*20
If we keep only the digits in front of every power of two, this gives us 11100110. We call 11100110 the base-2 representation of 230. Since all digits in base-2 are one of only two values (zero and one), we give base-2 the name binary, which means "two-valued".
So, what is so special about binary? There are only two values, and in computers, it is very easy for the physical components that make it up to represent the two values:
- A switch can be turned on or off.
- A magnetic particle on a hard drive platter can have a positive or negative field orientation.
- A spot on a CD may contain a pit or not.
- An electrical wire can hold a voltage or none.
Each of these physical components have two states, and they can be used to represent one and zero respectively.
ASCII
By itself, the fact that computers can easily represent numbers as a sequence of the ones and zeros doesn’t do a whole lot. But if we assign meaning to those numbers, we can start to do some interesting information processing. The obvious first thing to do is to encode text as numbers. To solve this problem, a group of people got together and decided to assign a number to every possible character they could think of. The result is known as the American Standard Code for Information Interchange, or ASCII. There are competing efforts such as EBCDIC (Extended Binary-Coded Decimal Interchange Code), which is still used on many IBM servers, but ASCII is the dominant standard used today. A full table of the translations from characters to numbers can be found at www.asciitable.com.
In the ASCII table, we can see that the letter A was assigned the number 65. B was assigned 66, etc. All of the characters in the ASCII table are assigned numbers from 0 to 255. This means that all English language text can be written as a sequence of numbers where each number is less than or equal to 255. In binary, a number less than or equal to 255 can be represented with eight base-2 digits, or bits. Eight bits is called a byte, and bytes are the basis of how we count the size of memory and hard drive space. A kilobyte is 1024 (210) bytes. A megabyte is 1024 kilobytes, etc.
Using a single byte per character works fine for the English language, but many languages have different alphabets. A good number of languages don’t even use an alphabet, but instead have a very large number of characters. To support other languages, many computers now use what is known as Unicode. Unicode devotes two bytes per character. With the sixteen bits in two bytes, Unicode can represent 216 or 65,536 unique characters for each language, which turns out to be more than enough. Most applications within Windows XP can deal equally well with ASCII or Unicode text, and internally, Windows XP uses Unicode for all text.