(This post is part of the programming languages 4th edition series.)
Programming Language Translation Issues
These notes are based on Programming Languages: Design and Implementation.
Section 3.1: Programming Language Syntax
define syntax:
the arrangement of words as elements in a sentence to show their relationship
In programming terms, syntax describes the sequence of symbols that make up valid programs.
- Syntax alone is not sufficient for preventing ambiguity
x=2.45 + 3.67
can mean different things based upon the type of x and whether+
is integer addition or real addition
define semantics:
Non-syntax attributes of a programming language. Includes declarations, operations, sequence control, etc
Section 3.1.1: General syntactic criteria
Readability
- A program is readable if data/algorithm is apparent by inspection of program text
- A readable program is said to be self-documenting
- The following features enhance program readability:
- natural statement formats
- structured statements
- liberal use of keywords and noise words
- embedded comments
- unrestricted length identifiers
- mnemonic operator symbols
- free-field formats
- complete data declarations
- Readability of a program is not guaranteed by language design
- Bad programmers make unreadable code, no matter the language!
- Bad syntax design can make it very hard for good programmers to write
readable code
- APL
-
COBOL was designed to emphasize readability
- writing and translation at the expense
-
It is good when syntactic differences reflect semantic differences
- Program constructs that do similar things should look similar
- Program constructs that do different things should look different
- Languages that provide only a few syntactic constructs lead to less readable
programs
- APL
- SNOBOL4
- Some syntax errors may alter the meaning of a statement without being
syntactically incorrect.
- Mismatching parentheses in LISP
Writability
- Syntax features that make programs easy to write usually make them hard to
read
- Writability enhanced by concise and few syntax structures
- Readability enhanced by verbose and varied syntax structures
- Implicit syntax (allowing declarations/operations to be unspecified) make programs shorter and easier to write but harder to read.
- A syntax is redundant if it can say the same thing in more than one way
- Sometimes this is useful
- Sometimes this makes programs harder to write
- Languages that have no redundancy are often difficult to use
Ease of verifiability
- Program correctness
- Also known as program verification
- Understanding programming statements is easy, but producing correct programs is hard.
- Some programs can be mathematically proven correct
- This will be discussed later
Ease of translation
- Another goal is making programs easy to translate into executable form
- Readability/writability directed at programmer needs, ease of translation
directed at the program translator.
- Lisp is not very readable or writable but is very easy to translate
- COBOL is hard to translate due to number of statement forms allowed
Lack of ambiguity
- Ambiguity is a central problem in every language design
- A language ideally provides a unique meaning for every syntax construct
- Problems of ambiguity usually arise in interplay between different structures
Section 3.1.2: Syntactic elements of a language
Character set
- Character set choice one of the first choices to be made in syntax design
- Due to various written languages, 16-bit representations must be considered
Identifiers
- widely accepted syntax:string of letters and digits that start with a letter
- Variations between languages include
- inclusion of special characters (such as . and _)
- length restrictions
Operator symbols
- Most languages use + and - to represent the two basic arithmetic operations
Keywords and reserved words
define keyword:
an identifier used as a fixed part of the syntax of a statement ("if" beginning a conditional or "for" beginning an iteration)
define reserved word:
a keyword that may not be used as a programmer-chosen identifier
- Most languages used reserved words
- Most languages begin with a keyword designating the statement type
- IF
- WHILE
- FORTRAN is difficult due to "DO" or "IF" may not necessarily indicate
iteration or conditional statement
- DO and IF are not reserved words. They can be variable names.
- Addition of a new reserved word to a program can break previous programs that used the new word as a variable name.
Noise words
define noise word:
Noise words are optional words that are inserted into statements to improve readability.
- Example: COBOL "GO TO"
- GO required, TO optional
Comments
- Comments are important part of program documentation
- Several comment types
- Separate comment lines
- Delimited by special markers (/ / in C)
- Beginning anywhere and terminated by end of line (// in C++)
Blanks (spaces)
- Rules on use of blanks vary between languages
- In C, blanks are not significant except in string literals
- Some languages use blanks as separators, so they are important
- SNOBOL4, concatenation represented by a blank
Delimiters and brackets
define delimiter:
A syntax element used to mark the beginning or end of some syntax unit such as a statement or expression.
- Brackets are paired delimiters
- parenthesis
- begin...end pairs
- Usually serve the purpose of removing ambiguities by defining explicit boundaries
Free and fixed-field formats
define free-field:
A syntax is free-field if program statements may be wr itten anywhere on an input line without regard for positioning on the line.
define fixed-field:
A syntax is fix field if the positioning on an input line conveys information.
- Fixed field syntaxes are a remnant of the punch-card era
- Fixed-field syntax is increasingly rare today
- free field is the norm
Expressions
define expression:
An expression is a function that accesses data objects in a program and returns some value
- The basic syntactic building block from which statements are built
- In imperative languages, expressions form the operations that allow for the machine state to be changed by each statement.
- In applicative(functional) languages, expressions form the basic sequence control that drives execution.
Statements
- The most prominent syntactic component in imperative languages
- Statement syntax has critical effect on
- regularity
- readability
- writability
- Most programming language provide different syntax structures for different statement types
- Different types of statements:
- A simple statement contains no other statements
- A structured(AKA nested) statement may contain embedded statements
Section 3.1.3: Overall program-subprogram structure
Separate subprogram definitions
- C has each subprogram definition treated as a separate syntax unit
- Each program compiled separately
- Compiled programs linked at load time
- Object orientation requires information to be passed among the separately
compiled units
- Inheritance requires the compiler to process some of the subprogram issues
Separate data definitions
- Group together all operations that manipulate a data object
- Subprogram may consist of all operations that create/print data object
- The general approach of the class mechanism if Java, C++, Smalltalk
Nested subprogram definitions
- Important for building modular programs during period of ALGOL, FORTRAN, and Pascal.
- In Pascal, subprogram definitions appeared within main
- May contain other subprograms nested within their definitions
- This type of subprogram allows for static type checking
Separate interface definitions
- FORTRAN permits easy compilation of subprograms, but data used across
subprograms may have different definitions
- compiler will not be able to detect this at compile time
- Pascal allows compiler to have access to all these definitions to aid in
finding errors
- However, the entire program must be recompiled at the slightest change
- C, ML, and Ada use aspects of both techniques to improve compilation
define implementation:
a program implementations consists of several subprograms that are intended to interact together. The components, called modules, are linked together to create an executable, but only any changed components need to be recompiled.
- With implementations, data passed among procedures of a component must have
common declarations like in Pascal.
- Leads to efficient compiler checking
- To pass information between too separately compiled components, additional
data is needed.
- C header files form the specification component, .c files the "implementation component"
Data descriptions separated from executable statements
- COBOL contains early form of component structure
- Data declarations/statements divided into data and procedure "divisions"
- The environment division consists of declarations involving the external operating environment.
Unseparated subprogram definitions
- SNOBOL4 represents extreme in program organization
- No syntactic distinction made between main program and subprogram statements
- In SNOBOL4, statements simply execute
- Where subprograms begin and end are not different in syntax
- Execution of a function starts a new subprogram
- RETURN function ends a subprogram
- Very chaotic! Only useful in allowing run-time translation
Section 3.2: Stages in translation
- Translation of a program from syntax to executable is central in every implementation
- Translation can be simple
- Perl
- Prolog
- Lisp
- Translation is usually complex
- Translation is trivial if you're willing to accept slow execution speed
- Execution speed is a big deal, that's why we go through the effort of translating programs into efficient executables
Translation can be divided into two major parts: analysis of the source and synthesis of the executable program object. These processes alternate, often on a statement-by-statement basis.
Translators can be grouped by the number of passes they make over the source:
- Simple compiler usually uses two passes
- First analysis pass decomposes program. Gets info from program
- Second pass typically generates an object from the collected information
- If compilation speed is important, one pass may be used
- As program is analyzed, it is instantly converted to object code
- If execution speed is important, a three-or-more pass compiler may be used
- First pass analyzes source program
- Second pass rewrites the source program into a more efficient form via well defined optimizations
- The third pass generates the object code
- The relationship between number of passes and compiler speed is no longer a
big factor.
- The big factor now is language complexity
Section 3.2.1: Analysis of the source program
To a translator, the source appears at first as one long sequence of symbols. The subprograms, statements, etc, visible to the programmer are not apparent to the translator yet.
Lexical analysis (scanning)
- The initial phase of any translation is to group the sequence of characters
into its parts:
- identifiers
- delimiters
- operator symbols
- numbers
- keywords
- etc
- This phase is called lexical analysis, and the program units produced are called lexical items or tokens.
- Individual lexical items are called lexemes
- Lexical items convey a single meaning
The lexical analyzer (scanner) reads lines of the input source, breaks down the lines into lexemes, and passes the lexemes to the later stages of the translator.
The lexical analyzer must:
- identify the type of each lexeme
- number
- identifier
- delimiter
- etc
- attach a type tag to the lexeme
- convert items to internal representations
Lexical analysis usually requires a majority of translation time.
Syntactic analysis (parsing)
- The second stage of translation
- Larger program structures are identified using lexical items
- statements
- declarations
- expressions
- etc
Syntactic analysis usually alternates with semantic analysis
- Syntactic analyzer groups a sequence of tokens into a syntactic unit such as
- statement
- expression
- declaration
- Semantic analyzer processes this unit.
- Syntactic and semantic analyzers communicate via stack
Semantic analysis
- The central phase of translation
- Syntactic structures recognized by the syntactic analyzer are processed
- The bridge between analysis and synthesis stages of translation
- Many subsidiary functions occur in this stage
- symbol-table maintenance
- most error detection
- expansion of macros
- execution of compile-time statements
- The output is usually some internal form of the final executable
- This executable is then manipulated by the optimization stage
- The semantic analyzer is usually split into a set of smaller semantic
analyzers.
- Each small semantic analyzer will handle a particular type of program construct.
- These small analyzers interact with each other, usually through the central symbol table.
The functions of the semantic analyzers depend on the programming language and the organization of the translator. The most common functions are:
Symbol-table maintenance
define symbol table:
One of the central data structures in every translator. It contains an entry for each different identifier encountered in the source program.
- The lexical analyzer makes the initial entries
- The symbol-table entry contains more than just the identifier:
- type (simple variable, array name, parameter, etc)
- type of value (integer, float)
- referencing environment
- Semantic analyzers enter information into the table as they process:
- declarations
- subprogram headers
- program statements
- Other parts of the translator use this information to make efficient code
The symbol table for compiled languages is usually discarded at the end of translation.
- It may be retained
- Useful for debugging
- Used by languages that allow new identifiers to be created at run-time
Insertion of implicit information
- Information implicit within the source must be made explicit in the lower
level object program.
- Type guessing
- FORTRAN: var used without declaration can be type guessed based on first letter
Error detection
- Syntactic and semantic analyzers must be able to handle incorrect programs.
- The lexical analyzer may send the syntactic analyzer a token that does not fit its surrounding context
- The semantic analyzer must recognize errors, produce an error message, and continue syntactic analysis of the rest of the program.
Macro processing and compile-time operations
-
Not all languages include macro features.
- For languages that do, they are generally handled in semantic analysis
-
A macro is a piece of program text that has been separately defined that will be inserted into the program during translation when the macro call is encountered
- Like a subprogram, except the macro body is substituted for each call during translation
- Sometimes macros are just simple strings, like the constant PI to be 3.14
- Sometimes they look like subprograms with parameters
-
Semantic analyzers must identify the macro call and perform the substitution
- This usually involves interrupting lexical and syntactic analyzers and directing them to work on the macro body.
- If the macro body is already partially translated, the semantic analyzer can process it directly, inserting the object code and making symbol table entries.
-
A compile-time operation is an operation intended to be performed during translation with the purpose of controlling the translation of the source
- C "#define" lets constants/expressions be evaluated before compilation
- C "#ifdef" (if defined) allows for different code to be compiled based on the presence/absence of certain variables
An example C source file that can compile to alternate versions
# define pc /* Set to pc or unix */ ... DoStuff() #ifdef pc /* do pc version */ #else /* do unix version */ #endif
Section 3.2.2: Synthesis of the object program
The final translation stages deal with constructing the executable program from the "internal executable" produced by the semantic analyzer. Code generation and optimization will occur here. If subprograms are translated separately, or if libraries are used, linking and loading will be needed to produce the executable.
Optimization
- Code generators generate object code from the semantic analyzer's produced intermediate code.
- Before generation, optimization of the intermediate code occurs.
- The semantic analyzer usually does not worry about surrounding code
- This means the initial intermediate code can be very inefficient
Example: The statement A=B+C+D
The semantic analyzer may generate the following intermediate code
Temp1 = B+C Temp2 = Temp1 + D A = Temp2
The code generators may generate the following inefficient code
load register with B add C to register store register in Temp1 load register with Temp1 add D to register store register in Temp2 load register with Temp2 store register in A
The register storage and loading is redundant since all data can be kept in the register before storing the result in A.
Let the semantic analyzers produce poor code, and clean it up during optimization.
Code generation
- After intermediate code has been optimized, it must be formed into one of
- assembly language
- machine code
- other object form
- Code generation involves formatting the output from the information contained in the internal program representation (intermediate code)
- Output code may be directly executable, or may require linking/loading
Linking and loading
- The final stage of translation
- Output code from separate subprogram translations are placed into final executable
- Up until this point, executable programs are in almost final form
- except where the programs reference external data or subprograms
- The incomplete locations are specified in loader tables created by the translator
- The linking loader uses the loader tables to link separate translation code together.
- The result is the final executable, ready to be run
Bootstrapping
- Often the translator for a new language is written in that language
- Initial Pascal compiler was written in Pascal
- How does this process get started?
- How does the first compiler get compiled?
- The process is called bootstrapping
- One way to bootstrap is to write the compiler by hand
- The process is tedious but not difficult
- Now the hand translated compiler can be executed
- A compiler written in the language itself can now be made using this hand translated compiler
Diagnostic compilers
- Rapid turnaround and compilation time were the major goals
- Student programs did not execute very long
- Student programs were discarded after being executed correctly
Section 3.3: Formal translation models
define grammar:
The formal definition of the syntax of a programming language
- grammar consists of a set of rules (called productions)
- the rules specify the sequence of characters (lexical items) that form allowable programs in the language being defined.
A formal grammar is just a grammar specified using strictly defined notation.
- The two classes of useful grammars in compiler technology:
- The BNF grammar (context-free grammar)
- regular grammar
Section 3.3.1: BNF Grammars
A simple English sentence is often of the form subject
/ verb
/ object
.
Example sentences:
The cat
/ran
/home
The monster
/scared
/me
You can break down the examples above further as article
/ noun
/ verb
/
object
.
There are many other types of sentences. An interrogative (question) sentence
may look like auxiliary verb
/ subject
/ predicate
such as
Did
/the cat
/run home?
Is
/the monster
/scaring you?
So with a very simple persecutive, we could say a sentence may be a declarative sentence or a question. Notationally, we would say
<sentence> ::= <declarative> | <interrogative>
The symbols ::=
stands for "is defined as" and the |
symbol stands for "or"
We can then further define <declarative>
and <interrogative>
:
<declarative> ::= <subject> <verb> <object> <subject> ::= <article> <noun> <interrogative> ::= <auxiliary verb> <subject> <predicate>?
This notation is called BNF, Backus-Naur form. For our purposes, you can consider BNF grammar and Context-free grammar as equivalent terms.
Syntax
- A BNF grammar is composed of a finite set of BNF grammar rules
- These rules define a language - in our case a programming language
- Syntax is only concerned with form, not meaning.
- A programming language consists of a set of syntactically correct programs
- These programs are a sequence of characters
- A syntactically correct program may not make any sense semantically
For example, The home
/ ran
/ girl
fits the syntax of a simple declarative
sentence, but it doesn't make any sense.
-
The only restrictions an what we call a language:
- Each string must be of finite length
- Each string must contain characters from some fixed alphabet of symbols
-
Examples of languages
- The set of all C assignment statements
- The set of all C programs
- The set of sequences of a's and b's where all a's precede the b's
- ab
- aab
- abb
- Using English to convey language definitions can make things confusing and unclear.
- We solve the problem by using a formal set of rules for determining exactly what strings are in the language.
- In the simple case, a grammar may list the elements of a finite language
For example,
<digit> ::= 0|1|2|3|4|5|6|7|8|9
- The "digit" term is called a syntactic category or a nonterminal.
- It's the name of the language defined by the grammar rule.
-
The symbols making up the strings of the language are called terminal symbols
-
Sometimes an arrow is used instead of
::=
.<X> ::= <B>|<C>
can be written as X → B|C
Once we've defined a basic set of nonterminals, we can use them to construct more complex strings. Consider the following rule that defines the language of conditional statements
- "Boolean expression and "statement" are syntactic categories (non-terminals)
Syntactic categories as they are defined may refer to themselves in the rule. Such a rule is called a recursive rule. For example,
-
This rule defines an "unsigned int" as a sequence of digits.
- The first option is a single digit
- The second option allows another digit to be added on to the initial digit, and so on.
-
A complete BNF grammar is a set of grammar rules
- Together, the rules define a hierarchy of sub-languages
- This hierarchy leads to the top-level syntactic category
- For a programming language, this top-level is usually called
<program>
Here's an example of a grammar defining an assignment statement, supposing the
nonterminals <identifier>
and <number>
are already defined:
Here are some example strings in the language
And here are some example strings NOT in the language
Parse Trees
If we're given a grammar, we can use a single-replacement rule to generate strings in the language. For example, the following grammar generates all sequences of balanced parenthesis:
S → SS | (S) | ()
Starting with just S, how can we generate the string "(()())"?
- S → (S), yielding (S)
- Since S → SS, we can substitute into (S) to get (SS)
- Since S → (), we can substitute into (SS) to get (()S)
- Since S → (), we can substitute into (()S) to get (()())
We can show this in a more concise manner, using the symbol ⇒ to indicate that one string is derived from another string.
S ⇒ (S) ⇒ (SS) ⇒ (()S) ⇒ (()())
- Each term in the derivation is called a sentential form.
-
We formally define a language as the set of sentential forms consisting entirely of terminal symbols.
-
To determine if a string is a valid program in the language defined by a BNF grammar
- Use the grammar rules to parse the string
- If the string can be parsed, it is in the language
- If there is no way of parsing the string with the rules, it is not in the language.
- Parse trees result from syntactic analysis
To see if W=Yx(U+V)
is in the language of assignment statements defined
earlier, we attempt to construct a parse tree:
- The BNF grammar rules cause the parse tree to in fact be a tree.
- The leaf is a single character / lexical item in the input string
- The intermediate branches are syntactic categories to which the subtree below it belongs.
The same language may be defined by many different grammars.
- BNF grammars cannot define areas involving contextual dependence:
- Preventing multiple definitions of the same identifier
- Preventing a 2d array from being referenced with 3 subscripts
Ambiguity
Avoiding ambiguity is an important problem with syntax. Consider the sentence "They are flying planes". This sentence can be represented in a couple of ways:
They
/are
/flying planes
They
/are flying
/planes
These statements can mean different things although they come from the same string.
- Ambiguity is often a property of a grammar
- NOT of a language
Consider a grammar G, which generates all binary strings:
G: S → SS | 0 | 1
The grammar G is ambiguous. How do we know? We can construct two distinct parse trees for the string "001"
If every grammar for a language is ambiguous, the language is called inherently ambiguous.
The language of all binary strings is not inherently ambiguous, though. The following grammar generates binary strings unambiguously
G2: T → OT | 1T | 0 | 1
And here's the parse tree for the string "001"
Extensions to BNF Notation
- BNF grammars are not ideal for explaining the rules of a language syntax for
programmers.
- The BNF rule simplicity forces unnatural representation for
- optional elements
- alternative elements
- repeated elements
- Simple syntactic constructs become complex to notate
- The BNF rule simplicity forces unnatural representation for
For example, defining the language of signed integers: digits preceded by a + or - sign
<signed integer> ::= +<integer> | -<integer> <integer> ::= <digit> | <integer><digit>
Extensions to BNF simplify this issue.
Extended BNF notation
These extensions do not alter the power of BNF in anyway, they just make the description of the languages easier
- Optional elements can be enclosed is square brackets []
- Alternative elements can be distinguished with the | symbol
- An arbitrary sequence of element instances can be specified with {...}*
- The sequence can be empty
An example using this extended BNF notation:
<identifier> ::= <letter>{<letter>|<digit>}*
As another example, consider our assignment statement BNF, rewritten to use extended BNF notation
Section 3.3.2: Finite-State Automata
- Lexical analysis breaks source into a stream of tokens
- A state machine can recognize tokens such as
- Identifiers
- Digits
- An if statement
An example Finite-State Automata (FSA) that recognized an odd number of 1s
- Denote the starting state with an arrow that does not come from another state
- The final state is indicated with a double circle
Example application of the FSA above:
Input | Ending state | Accept the string? |
---|---|---|
null | A | no |
100 | B | yes |
10010 | A | no |
100101 | B | yes |
Characteristics of an FSA
- Has a starting state
- One or more final states
- A set of transitions (labeled arcs) from one state to another
- Any string that takes the machine from the initial to the final state is "accepted" by the machine.
Section 3.4: Recursive descent parsing
- We can rewrite extended BNF rules as procedures
Consider the rule for arithmetic expressions discussed earlier:
<arithmetic expression> ::= <term> {[+|-]<term>}*
If we assume that
- The variable
nextchar
always contains the first character of the respective nonterminal getchar()
reads in a character
Then we can rewrite the BNF grammar as the following recursive procedure:
procedure Expression; begin Term; /* Call procedure Term to find first term */ while ((nextchar='+') or(nextchar='-')) do begin nextchar := getchar; /* Skip operator */ Term; end end