The most basic element of source code is a token. Tokens are pieces of code that cannot be subdivided. In Object Pascal, all tokens can be classified as reserved words, numbers, identifiers, symbols, characters, or character strings. Comments, whitespace, and compiler directives are not tokens.
Excercise 2-1. How many tokens are in the source code below?
{$APPTYPE CONSOLE}
program TokenTest;
begin
Writeln('How many tokens?');
Readln;
end.
There are 13 tokens. Here they are with types listed:
program // Reserved word
TokenTest // Identifier
; // Symbol
begin // Reserved word
Writeln // Identifier
( // Symbol
'How many tokens?' // Character string
) // Symbol
; // Symbol
Readln // Identifier
; // Symbol
end // Reserved Word
. // Symbol
Breaking source code into tokens is known as tokenization and it is the very first step a compiler performs when analyzing code. Tokenization is also used by syntax highlighters when choosing the color and style for displaying source code. Syntax highlighters are extremely useful because they allow us to recognize groups of symbols/characters as distinct tokens. For example, consider the same code without highlighting:
program TokenTest;
begin
Writeln('How many tokens?');
Readln;
end.
It is much more difficult to see that the string
'How many tokens?' is a single token rather than several, especially if you did not know that a single quote (') started a string.
Let's examine each class of token. Reserved words and identifiers are very similar, but you will always be able to identify a reserved word by its difference in color in the editor. Outside of a syntax-highlighting editor you would simply have to know the reserved words by memory (there are around 80 in object pascal).
Numbers are anything consisting of the digits from
0 to
9, decimal points, or exponential specifiers. Digits and decimal points are self explainatory (e.g.,
12 and
12.5 are valid numbers). The exponential specifier is used for something known as
scientific notation. Large numbers such as 1 trillion are bulky to write so scientific notation simplifies the process by writing the number as a product of some number and a power of 10. So, the number 1 trillion can be written in scientific notation as the following where 12 is the number of zeros in the number.
In Object Pascal, we specify 1 trillion as
1e12. We can also specify decimal numbers such as 1 millimeter in this notation:
In Object Pascal we would write
1e-3. In order to experiment with this, open the file Workspace in Unit1/Workspace. We will use this template to experiment with various bits of code throughout this chapter so don't worry about overwriting the code inside of it each time we do a new exercise. Enter the following code into the editor:
{$APPTYPE CONSOLE}
program Workspace;
begin
Writeln(1e12);
Writeln(1e-3);
Readln;
end;
If all goes well, you should see the lines 1.0000000000E+0012 and 1.0000000000E-0003 printed to the screen. You will also notice that the editor highlighted the entire number 1e12 and 1e-3 as a number. This is because in Object Pascal, 1e12 and 1e-3 is interpreted as a single token (a number). Now try adding the two together and let's put in an additional + sign in 1e+12:
Writeln(1e+12+1e-3);
Notice now that the first + is colored as a number but the second is not (Note: If you are using Lazarus there was a bug in the syntax highlighter at the time of this writing which miscolors the + and - signs. If you are using Delphi the syntax is colored correctly. Ignore the editor for now and pay attention to the figure above). The first + is recognized as part of the numeric token in and the second + is recognized as a binary operator. Thus, the same symbol can have different meanings depending on the tokens that have preceeded it (and often in modern languages also depending on the token that immediately follows it). The way the compiler achieves this is actually somewhat complex, but you will fortunately get used to it very quickly with a bit of coding experience.
Numbers are classified as either integers or reals. In mathematics, integers are numbers without decimal points and reals consist of all integers plus all numbers with decimal points. In Object Pascal, the definition of integers and reals is similar except numbers like 1e12 with exponential specifiers are automatically considered reals:
Excercise 2-2. Identify each of the following as integer, real, or not a valid number in Object Pascal.
This is a real number in Object Pascal.
Though this is a real number in mathematics, Object Pascal requires at least one digit to proceed a decimal point for the token to be a valid number.
This is a real number in Object Pascal. The + sign between the e and 4 is optional.
This is an integer in Object Pascal.
The 3 is an integer in object pascal but the - is actually a unary operator which means -3 consists of 2 tokens. This can easily been seen by observing the syntax highlighting of -3 in the editor.
Identifiers in Object Pascal must begin with a letter or underscore (_) and then can be followed by any combination of letters, digits, or underscores as long as that combination is not a reserved word. Single identifiers cannot be broken by symbols.
Excercise 2-3. Which of the following are valid identifiers in Object Pascal?
This is a valid identifier in Object Pascal.
This is a reserved word in Object Pascal and therefore cannot be used as an identifier.
Identifiers cannot start with numbers.
This is a valid identifier in Object Pascal.
This is a valid identifier in Object Pascal.
Identifiers cannot be split with symbols. rec and field are two separate identifiers.
Identifiers cannot be split by whitespace. My and Function are two separate identifiers.
The final token types to discuss are the character and character string. A character in object pascal is a single-quoted symbol such as ';'. Because it is surrounded by single quotes, this is the semicolon character instead of the semicolon symbol that is used to delimit lines of code. Try the following in the editor:
Writeln(';');
Writeln(;);
Readln;
You will note that this code fails to compile. If you remove the Writeln(;) statement, the code will compile successfully and display a semicolon on the screen. The reason for the distinction between a character and a symbol is that the compiler has to know whether or not you are issuing it a command with a symbol or are asking it to take in text verbatim. A character string is simply a series of characters that are delimited by single quotes such as 'Hello World!'. If you want to actually output a single quote character you have to repeat it (e.g., 'Object Pascal''s Syntax'). This is known as escaping and it tells the compiler that you mean for the single quote to act as a character and not a symbol. You should know that almost every language has slightly different mechanisms for escaping characters so do not expect the double apostrophe in 'Object Pascal''s escaping syntax' to work in PHP or Python for example.
Excercise 2-4. What will be the screen output of the following statements?
Writeln('Object Pascal');
Writeln('Object Pascal's Syntax');
This will not compile because the single quote after Pascal will terminate the string. This means s and Syntax will be interpreted as identifiers. Note the syntax highlighting:
Writeln('Object Pascal's Syntax');
This will not compile because the string has not been terminated by a second single quote.
Tokens combine to form more complex constructs such as statements, expressions, and declarations as will be discussed in the following sections.