The purpose of a language in programming is to define a set of valid instructions for the computer. Programming languages are classified on a spectrum that goes from high-level languages to low-level languages. The terms high-level or low-level describe the level of abstraction of a language. Abstraction is an extremely important concept in programming. The phrase "more abstract" has many connotations from meaning (1) something that is more difficult to understand to (2) being further removed from something concrete. Here we mean the second definition. The more abstract or high-level a language is, the further it is removed from the actual instructions executed concrete CPU of the computer. The reason this is important is that the instructions executed by the processor are optimized to be read by a machine. For example, take the following lines of numbers:
10111000 00000101 00000000 00000000 00000000
10111010 00000011 00000000 00000000 00000000
11000001 11100000 00000010
10001101 00010100 01010010
00000011 11000010
Believe it or not, this is a very simple computer program. It is written in a
bytecode (also known as
machine code) language known as x86. These numbers are
binary numbers which means that they contain only 0's and 1's. The reason binary numbers are used is that the actual physical devices used in processors (and memory and hard-disk storage for that matter) are extremely small and thus susceptible to noise. To understand this, imagine that you're in a room and someone asks you to tell him/her how bright a light is. Your options are 0% brightness, 25%, 50%, 75%, and 100% and you have observed each level from 0% to 100%. Now someone turns a knob to one of those 5 levels randomly and asks which level it is at. Chances are you could probably guess with reasonable accuracy. Now imagine that someone turns the knob to one of those levels and then jitters the knob so that this lights flicker slightly. This is noise. The greater the noise the harder it will be for you to guess which brightness level the light is actually at on average. Now, consider the same situation but this time the only question is whether or not the light is on or off. Someone turns the knob and jitters it, but this time it is much easier to distinguish a bright jittered light from a dim jittered light. The situation is identical at the microscopic level inside a computer. It is easy to distinguish whether a component is either at a strong or weak electrical potential even in the face of thermal noise. A binary system reduces the chances of errors in both the processor and in data storage. This is why digital data (DVDs, MP3s, etc) are generally higher quality than analog data (video tapes and vinyl records).
In the simple program above, notice that the numbers are arranged in groups of 8's. Each 0 or 1 is called a
bit and each set of 8 bits is a
byte. A byte that represents an instruction is called a
bytecode. In computing, all units are generally in powers of 2 (e.g.,
). This means that the metric prefixes kilo, mega, and giga are specified to the nearest power of 2 (
). A kilobyte is 1024 bytes, and megabyte is 1024 kilobytes, and a gigabyte is 1024 megabytes. (As a side note, if you have ever wondered why the 250 GB drive you purchased shows up as having only 233GB on your computer, it is because companies often incorrectly advertise the number of bytes by multiplying by 1000 instead of 1024.) So the simple program above consists of 18 bytes of code. The bytes are broken into 5 lines because each line represents a distinct command in the machine language. During the birth of computers (the 1960's), programmers literally flipped switches "on" or "off" that allowed them to enter individual binary commands. As computers evolved, the need to abstract away from coding in binary arose and assembly language was born. We can rewrite the program above in x86 assembly language (if you own a PC or a relatively new Mac then this is the language your processor understands):
mov eax,5
mov edx,3
shl eax,2
lea edx,[edx+edx*2]
add eax,edx
This is definitely more interpretable than the binary codes (although not by much). "mov" means move and "add" means add for example. "eax" and "edx" are known as registers. You can think of these as temporary storage spots that are physically close to the processor (Keep in mind that at 3 billion operations per second the distance an electron has to travel becomes very important. Thus, registers are the fastest form of storage followed by memory which is farther away and than hard disk storage which is both far away and often mechanical rather than only electrical). Notice that the language is describing and commanding physically defined components of the processor. Also, each line in assembly language corresponds exactly to a line of machine language. In other words, the processor understands that 10111000 00000101 00000000 00000000 00000000 means move the number 5 into the register eax. In other words, the machine language and assembly language are 1 to 1. If you have one you can translate directly into the other as long as you have a key that says what command corresponds to what machine bytecode just like a simple cryptography cipher. Just to drill this point home, here is the code above written side by side in which the human readable code on the left is represented in processor-readable code on the right:
mov eax,5 // 10111000 00000101 00000000 00000000 00000000
mov edx,3 // 10111010 00000011 00000000 00000000 00000000
shl eax,2 // 11000001 11100000 00000010
lea edx,[edx+edx*2] // 10001101 00010100 01010010
add eax,edx // 00000011 11000010
Of course, as computers continued to become more complex, more and more levels of abstraction were added. If we were to write the program above in the Object Pascal language it would look like the following:
X := 5;
Y := 3;
Z := 4*X+Y*3;
So now we see that our 18 byte simple program simply stores 5 and 3 and then does some arithmetic to obtain 4 times 5 plus 3 times 3 = 29. If you were to run the 5-line assembly program above or instruct the processor directly to execute those 18 bytecodes, you will find that register eax contains the number 29.
Note that here we have X, Y, and Z defined as the "storage" containers also known in modern programming as variables. X, Y, and Z are not physical components on the processor. Also, there were only two registers used in the assembly language and there are three variables here. This is because a compiler for a language reads the source code and tries to determine the optimal assembly language instructions that will result in the fastest code. This necessarily means that a high-level language is not 1 to 1 with low-level assembly languages and depends on the design of the compiler. For example, the following lines of assembly language will also result in a value of 29 in register eax.
mov eax,5
mov edx,4
mul edx
mov ecx,eax
mov eax,3
mov edx,3
mul edx
add eax,ecx
mov eax,eax
However, this set of instructions is slower than the ones the compiler chose. For one reason, there are 9 instructions for the processor here instead of 5. Additionally, the multiply command is slower than the more sophisticated "shl" and "lea" commands chosen by the compiler but the reasons for this are beyond the scope of this unit. What you should take away from this explaination is that a high-level language reduces the number of tasks left to a programmer through abstracting away from machine code. Thus, the purpose of a modern high-level programming language is to get the computer to do what the programmer wants with the easiest code possible. For all of the programming problems encountered in this unit you will be using only high-level constructs. Assembly will not be referred to again until Unit 3.