Programming Language Principles
Table of Contents
- 1. Intro
- 2. Lexical Analysis
- 3. Parsing
- 4. Binding Times
- 5. Sematic Analysis
- 6. Control Flow
- 7. Type Systems
- 7.1. Meaning of Type
- 7.2. Polymorphism
- 7.3. Orthogonality
- 7.4. Enumeration
- 7.5. Sub-range type
- 7.6. Composite Types
- 7.7. Type Checking
- 7.8. Type Equivalence
- 7.9. Type Conversion and Cast
- 7.10. Universal Reference type
- 7.11. Type inference
- 7.12. Parametric Polymorphism
- 7.13. Equality Testing and assignment
- 8. Composite Types
- 9. Subroutine and Control Abstraction
- 10. TODO Scott Supplement
- 11. λ Calculus
- 12. JVM Overview
- 13. Denotational Semantics
- 14. Axiomatic Semantics
- 15. OOP
1 Intro
1.1 Specification of programming lang:
- Syntax
- Semantics
1.2 Spectrum
- Imperative
- Declarative:
- Functional
- Logic
- query
- Concurrent
1.3 Compilation process
Source code → incomplete machine code → complete machine code ↑ ↖ Linker inserts system libraries
1.4 JVM
- has support for OOP, exception, threads
1.5 Bootstrapping
Suppose we want to make compiler for new language L.
- Assuming we have C compiler on our target machine, write compiler for L using simple subset of C.
- Once this compiler is complete, translate the compiler from C to L.
- Use the compiler from step 1 to compile the compiler written in L.
- Use the compiler from step 3.
Start with smaller subset of L to be implemented using C, then use L compiler to implement other L features.
1.6 Preprocessor
- Trivial manipulation of text using simple pattern matching.
- Usually no error messages
1.7 Turing Complete
A programming language is said to be Turing complete if it can be shown that
it is computationally equivalent to a Turing machine. 1
- In general, for an imperative language to be Turing complete, it needs2:
- A form of conditional repetition or conditional jump (
while
,if
+goto
) - A way to read and write some form of storage
- A form of conditional repetition or conditional jump (
- For a λ calculus based function language to be Turing complete, it needs:
- Ability to abstract over arguments (Eg: lambda abstraction, quotation)
- Ability to apply functions to arguments (Eg: reduction)
2 Lexical Analysis
2.1 Syntax v/s semantics
- Syntax: Structure of the language
Semantics: meaning of the constructs
// here the syntax is correct but it is semantically incorrect code, notice // return type of `funFoo` and the predicate, comparing int and string is // meaningless. int funFoo(int c) { if (c > "string") { return "type mismatch"; } else { return 2; } }
2.2 Regular Expression
2.2.1 ε
Empty String ""
2.2.2 Atom:
Eg: Σ = {a, b c}
2.2.3 Concatenation
- a ε = {a}
- ab = {ab}
2.2.4 Alternative: |
- a | b
- (a | b) c = {ac, bc}
2.2.5 Kleene Closure: *
- a* = {ε, a, aa, aaa}
- a(b|c)*a = {aa, aba, aca…}
2.2.6 Derived notations:
2.2.7 Restrictions
- Nothing can be defined recursively.
- The extension to RE can be shown to be equivalent to the basic operations just by substitution.
- disadvantage: recursive definitions cannot be expressed
- nested things cannot be expressed Eg: blocks, * /*…* */
- since comments cannot be expressed using RE, nested comments are not allowed
in Java/C.
- comments are removed during lexical analysis
2.3 Scanners
- tokenize using RE
- 01 → will be tokenized as two digit with nothing between it
- it will be marked as error by parser
- responsible for:
- tokenizing source
- removing comments
- dealing with pragmas
- saving text of identifiers, numers, strings
- saving source location for error messages
- whitespace, including newlines, is usually discarded, so getting line or column number after scanning phase is not possible.
2.3.1 Implementation
- Ad-doc
- mechanical DFA
- table-driven DFA
- In Java
- Token type
class Token { Kind kind; int pos; \\ -----| int lineNum; \\ -----useful for debugging }
- Use each state as case in switch (Eg: START, AFTERSTART)
- Enums:
- they have types
- they are ordered
- can be used in switch case
- Handling comments
- track pos and line number
- Reserved words v/s keyword
- reserved word: cannot be used as identifier
- keyword: cannot be used as identifier because they have special meaning in laguage
- Eg:
goto
in java, cannot be used as identifier but is not a keyword (does nothing)
- Token type
- Handling reserved word
- build identifier first then keyword
- list all the reserved words, then check if the identifier is present in the list
2.3.2 Whitespace
- skip
- used as seperators
- add another path on start which loops to start on whitespace.
2.3.3 Automatic DFA generation
- DFA can be generated based on RE for parsing.
- Regular Expression → ε-NFA → NFA → DFA
2.4 Informal Summary
- Lexical analysis forms stream of tokens from stream of characters
- Takes care of:
- comments
- line and column number for error reporting
- text/strings
3 Parsing
3.1 Regular and Context Free language:
- Any set of strings that can be defined in terms of the first three rules (Atom, Alternative, Kleene closure) is called a regular language
- Any set of strings that can be defined if we add recursion is called context free language (CFL).
3.2 CFG: Context Free Grammar
- RE is not recursive, need more powerful formalism
3.2.1 BNF
- Σ set of alphabet (terminals)
- N set of non terminals (variables)
- P set of productions (substitution rules)
- S start symbol
3.2.2 EBNF
In addition to BNF:
- | or
- multiple production from same variable
- * 0 or more
- + one or more
3.2.3 Example:
expr → factor | expr op factor factor → ident | numlit | (expr)
3.3 Non Context Free Grammar
- for CFG production doesn't depend on neighbors
- if non CFG, there would be more than one symbol on lhs.
- CFG has only one symbol on left hand side, so it doesn't depend on neighbors, hence the context free ness.
- programming language are not context free
- we handle it separately using other techniques.
3.4 Parse tree
- Scanner converts stream of characters into stream of tokens
- Parser constructs parse tree from stream of tokens.
3.5 Ambiguous Grammar
- more than one parse tree for given expression
if if else
example
3.6 Grammar Class
- we can create parser for any CFG which runs in O(n³)
- LL and LR run in linear time
- both are used in compilers, LR being more common
- LL v/s LR:
- LL tries to predict the next symbol by peeking the upcoming token and deciding which production to choose
- LR shifts token until it recongnizes a right hand side, it then reduces it to left hand side
3.6.1 LL: Left to right, Left-most derivation
- considered simpler than LR, smaller class than LR
- usually created by hand (not generated)
- Also called as top down, predictive parser
- usually hand written- we will use this in class
3.6.2 LR: Left to right, Right-most derivation
- usually generated
- larger class than LL
- also called as bottom up parser or shift reduce parser
- skim the topic from book, not covered in class
- Subtypes of LR:
- SLR
- LALR
- full LR
3.7 Grammar and parsing
3.8 Top Down parser
3.8.1 LL(K): look ahead k tokens
3.8.2 LL(*): can look ahead as much tokens as needed
3.8.3 First Set
First of terminal is the set containing the terminal
S → AB A → x | y B → z | w First(A) = {x, y} First(B) = {z, w} First(S) = {x, y} In general: S → A | B First(S) = First(A) ∪ First(B)
- for LL(1), there should not be common element in First of two elements
3.8.4 Follow Set
S → A B A → x | y B → z | w Follow(A) = {z, w} = First(B) Follow(B) = {} Follow(x) = Follow(A)
- In A → xA, Follow(A) is Follow(A) (see what comes after A, in RHS)
3.8.5 Predict Set
Predict set is defined for a production.
EPS(α) = (α →* ε)
Predict(A → α) = First(α) ∪ (if EPS(α) then Follow(A) else {})
- If A results in ε production then predict set consists of first and follow of A.
- Predict set tell us if grammar is LL(1)
- and which production to choose for next input.
3.8.6 LL(1) Rule
The predict set of all productions with the same left side should be disjoint.
This is neccessary and sufficient condition for a grammar to be LL(1)
3.8.7 Recursive decent parsing working
- determine which production to get based on predict set
- If the predict set of all productions with same left side are not disjoint, then there are multiple options for production which can produce next symbol (but not necessarily complete string).
3.8.8 Left recrusive grammar
- LL parsing doesn't work on Left recursive grammar
- Left recursive removal
A → A σ | β can be replaced with A → β B B → σ B | ε how to: expand the original grammar to get feel of the language (Strings generated) A → A σ | β will give: β σ, β σ σ, β σ σ σ... note that strings start with β and have several trailing σ so rewrite grammar to represent this using non left recursive grammar: A → β | B B → σ B | ε
3.8.9 associativity and precedence
- Associativity
- operator group left to right or right to left.
- Eg: 10 - 4 - 5 means (10 - 4) - 5 rather than 10 - (4 - 5)
- Precedence
- some operator group tightly than others
- Eg: 10 - 4 × 3 = 10 - 12 rather than 6 × 3
- Inbuilt associativity and precedence
we can make grammar in such a way that associativity and precedence is handled automatically.
expr → term | expr add_op term term → factor | term mult_op factor factor → id | number | (expr) add_op → + | - mult_op → × | ÷
- in above example, note how we have introduced term to make sure
mult_op
is grouped more tightly thanadd_op
- since operators are left associative, we have
expr
(which has potential to produce moreadd_op
) always to left ofadd_op
3.9 Parser Construction
- we have LL(1) grammar, construct recursive decent parser
3.9.1 Grammar Simplication
- useless symbols
- redundant stuff
3.9.2 Handling errors
3.10 Parsers
3.10.1 Backtracking Algorithms
- BFS
- worst case exponential time
- worst case exponential space
- can parse every grammar
- DFS
- worst case exponential time
- worst case near space
- Cannot parse left recursive grammar
- Both BFS and DFS deploy backtracking, if production does not apply, backtrack and try next production.
3.10.2 Predictive Algorithms
- Based on remaining input, predict which production to use
- Faster than backtracking - linear in speed
- Not all grammar can be parsed using predictive algorithm
- How to predict?
- use look ahead
- increasing number of lookahead makes parser more powerful but makes it more complicated too
- LL(1)
- Left to right scan of token
- Leftmost derivation
- (1): 1 token of lookahead
- Parse Table
1. E → int 2. E → (E Op E) 3. Op → + 4. Op → *
int ( ) + * E 1 2 Op 3 4 - Example:
E $ (int + (int + int)) $ 2 (E Op E) $ (int + (int + int)) $ E Op E) $ int + (int + int)) $ 1 int Op E) $ int + (int + int)) $ Op E) $ + (int + int)) $ 3 + E) $ + (int + int)) $ E) $ (int + int)) $ 2 (E Op E)) $ (int + int)) $ E Op E)) $ int + int)) $ 1 int Op E)) $ int + int)) $ Op E)) $ + int)) $ 3 + E)) $ + int)) $ E)) $ int)) $ 1 int)) $ int)) $ $ $
- Example:
- Construction of Parse table
- Intuition
The next character should uniquely identify a production, so we should pick a production that ultimately starts with that character
- First Sets
The First(A) is the set of terminals that can be at the start of a string produced by A.
- we can use First set to make parsing table.
- if rules contain ε productions, then we have to look through the
non terminals and see First of the next non terminal
- For A → α where α has ε in First set then: First(A) = First(A) ∪ {ε}
- For A → α t ω, where α First set contains ε and t is terminal: First(A) = First(A) ∪ {t}
- For A → α B ω, where B is non terminal and α has ε in First set: First(A) = First(A) ∪ First(B)
- Follow Set
Follow(A) is set of all terminal symbols that can immediately follow a
subsequence derived from A in a sequence derived from start symbol, S.
Num → Sign Digits Sign → + | - | ε Sign has ε production, so First(Sign) will contain First(Digits)
- The follow set represents the set of terminals that might come after a given
non terminal.
- If B → α A ω then: Follow(A) = Follow(A) ∪ First*(ω)
- If B → α A ω and ε ⊂ First*(ω) then: Follow(A) = Follow(A) ∪ Follow(B)
- In other words, follow of a symbol is that which can come after that, so that if that symbol goes to ε then the follow can appear. If the next symbol → ε then add follow of parent
- The follow set represents the set of terminals that might come after a given
non terminal.
- Final Construction
- For each rule A → ω, for each terminal t ⊂ First*(ω) set T[A, t] = ω (at A if you see t, use production to goto ω)
- For each rule A → ω, if ε ⊂ First*(ω): set T[A, t] = ω for each t ⊂ Follow(A) (at A, if A can ε and t can be produced by follow then take ε path and let A get lost)
- Intuition
- Limitation of LL(1)
- Left recursion
can be eliminated
A → A ω | α rewrite as A → α B B → ω B | ε Explanation: A can produce α ω, α ω ω ..., starts with α second rule for repeating ω and ε for stopping
Left factoring
E → T E → T + E T → int T → (E) rewrite as: E → TY Y → + E | ε ...
- Left recursion
- Condition for LL(1):
A grammar G is LL(1) iff for any productions A → ω1 and A → ω2 the sets:
First(ω1 Follow(A)) and
First(ω2 Follow(A))
are disjoint
This condition is equivalent to saying that there are no conflicts in the table
4 Binding Times
- Language design time: control flow constructs, primitive data types…
- Language implementation time: IO, size of stack, heap etc
- program writing time: variable name
- compile time: layout of static data
- Link Time: compiler compiles different modules at different times, binding to variables in other module is finalized during link time.
- Load time
- run time: address of variable.
4.1 Binding and visibility
void testFoo() { int arg = 100; for (int arg = 0; i < 10; arg++) { ; // arg here loops from 0 to 9 // binding for arg = 100 is active here but not visible } // here arg = 100 }
4.2 Storage Allocation Mechanism
4.2.1 Static Allocation
- Global variables
- literals
- tables generated for helping runtime: debugging, dynamic type checking etc
- Languages without recursion
- some languages such as Fortran (before Fortran90), did not have recursion.
- as a result, only one invocation of subroutine will be active at a time.
- space for the variables can be allocated statically and avoid runtime overhead of creation and destruction.
- Constants
- constants can be allocated space during compile time, since their value can be shared across recursive calls.
- Some constants value cannot be determined until runtime. Such elaboration-time constants are allocated on stack.
4.2.2 Stack Allocation
- if language allows recursion, local variables cannot be allocated statically.
- each instance of subroutine has its own frame or activation record
- each frame contains:
- arguments
- return values
- local variables
- temporaries: space for temp implicit var. (2 + 2 + 3), store 4 in temp then 4 + 3
- Bookkeeping information:
- subroutine's return address
- reference to stack frame of caller (dynamic link)
- each frame contains:
- offset of objects from frame base is known at compile time.
- Stack grows downwards (towards lower address) in most language implementation.
- Local variables, temporaries and bookkeeping information typically have
negative offsets from
frame pointer
- Arguments and returns typically have positive offsets–they reside in caller's frame
- Local variables, temporaries and bookkeeping information typically have
negative offsets from
- even if language doesn't support recursion, stack based allocation of local variables is better than static allocation since, not all subroutines are active at the same time, stack based allocation takes less memory.
4.2.3 Heap Allocation
- Issues:
- Internal fragmentation
- External fragmentation
- Allocation strategy:
- First fit: more time, more efficient
- Best fit: less time, less efficient
- List of preallocated standard size blocks: fibonacci heap and buddy system
- compacting requires moving already allocated blocks – requires updating existing pointer.
- Garbage Collection
- Implicit
- unreachable object
- on exiting scope - C++ RAII
- Implicit
4.3 Scope
- the textual region of the program in which the binding is active is its scope
- Bindings for local objects are created and binding for global objects are made invisible on entering a new scope.
- On scope exit, we delete local variables binding and reactivate the binding for global variables.
- This does not require any code to be executed → can be determined statically → statically scoped or lexically scoped
- Set of active binding is called current referencing environment
4.3.1 Static Scope
- there can be nested scope.
- search current scope, if not found then search outer scope. Repeat untill reach global scope.
- if no variable found even in global scope, then throw an error;
- Hole in scope
- A name-to-object binding that is hidden by a nested declaration of the same
name is said to have a hole in its scope.
- or we can say the closer object has eclipsed the outer definition.
- Some languages may have qualifier to access outer scope variables.
- A name-to-object binding that is hidden by a nested declaration of the same
name is said to have a hole in its scope.
- Objects in lexically surrounding scopes
- the frames of surrounding subroutines are already active on the stack.
- Static Chain
link to the surrounding block's stack
(blockA) { (blockB) { (blockC) { } (blockD) { } } }
in above example, during execution of
blockD
, the scope forblockB
has to be active. Thus, during execution ofblockD
, the stack will look like:A calls B calls D calls C
blockC blockD blockB blockA - C and D have static link to B
- B has static link to A
4.3.2 Declaration Order
- can we use a variable in a block before it was declared.
- Forward Reference
An attempt to use a name before it's declaration
- Mixed strategy used by languages
- In Java/C++, members are visible through out the class even before defined.
- Python, Scheme and Module2 does it differently. Refer book page: 133.
- Mixed strategy used by languages
- Declaration and Definition
- to implement mutually recursive data structures (example: manager and employee) C,C++ has declaration and definition.
- Declaration gives enough information to compiler about it's name and scope.
- Definition gives actual implementation details.
- Redeclaration
- most dynamic languages overwrites the older value.
In ML, however, the old meaning of variable is accessible to the functions defined before redeclation.
let x = 1; let f y = x + y; f 2 (* gives 3 *) let x = 3 f 2 (* we might expect 5, but the output is 3 (1 + 2) *)
- In other languages like Scheme, stores bindings of names in known locations. Whereas in ML, program is compiled that access old meaning of variable.
4.3.3 Dynamic Scoping
- The bindings depends on flow of control during runtime.
4.3.4 Implementing Scope
- Symbol table is dictionary which maps name to information about it.
- Static Scope:
- never delete anything from symbol table.
- Most variations on static scoping can be handled by augmenting a basic dictionary
style symbol table with
enter_scope
andleave_scope
operations to keep track of visibility.
- LeBlanc and Cook Table
- Each scope, as it is encountered, is assigned a serial number.
- Eg: The outermost scope is given number 0 (the one containing predefined variables
- The scope containing programmer defined global variables is given 1 and so on.
- All names are entered into single large hash table
- Each entry contains symbol name, category (variable, constant, type, procedure, fieldname, paramter), scope number and other details.
- Symbol table has scope stack that indicates the scopes that compose the current referencing environment.
- To look up a name in the table, we scan the hash table, for each entry in the table, we search scope stack to check scope of that entry has scope visible.
- Each scope, as it is encountered, is assigned a serial number.
- Dynamic Scope:
- A-list (Association list)
- a simple list of name-value pairs.
- When used to implement dynamic scoping it functions as stack: new declarations are pushed as they are encountered and popped as the end of the scope in which they appear.
- Central Reference table
- maintains a mapping from names to current meanings.
- new entries are added in front of the chain (bucket list (when key collides))
- lookup returns the entry at front of the chain.
- on exit, delete the entry at front.
- A-list (Association list)
4.3.5 Shallow and Deep binding (passing subroutine as parameter)
def sub1: def sub2: print x # prints 3 (deep), 4 (shallow) def sub3: x = 3 sub4(sub2) # <- deep binding created here def sub4 (subx): x = 4 subx() # <- shallow binding created here x = 1 sub3()
- Deep Binding
- referencing environment is the one when reference is created
- in the example, binding x = 3 is created when we pass
sub2
- Shallow Binding
- referencing environment is the one during routine is called
- in the example, binding x = 4 is created when we call
subx
- Deep Binding and Closure
- Note that Deep binding has reference to the binding environment.
- if we pass this subroutine (second class citizen), the subroutine can access the environment deep in the stack.
- but this still can be implemented using stack and pointers. No need to allocate space on heap.
- the issue arises when we return the subroutine and the stack unwinds. In this case to have closure we need unlimited extent and thus space allocation for the subroutine on stack. See 4.3.6.4
4.3.6 First Class and unlimited extent
- First Class
- can be passed as parameter
- can be returned from subroutine
- can be assigned to a variable
- Second Class
- can be passed as parameter
- can not be returned from subroutine
- can not be assigned to variable
- Third Class
- cannot be passed as parameter
- cannot be returned from subroutine
- cannot be assigned to variable
- Misc: Unlimited Extent
- check 4.3.5.3, unlimited extent needed only in case of first class function, second class can do with stack allocation.
- if we return subroutine and that subroutine has access to local variable
in surrounding subroutine,
- we cannot use stack based allocation for local variables
- the allocation must be on heap
- handled by garbage collection
- C do not have nested subroutine
- can have pure stack based allocation for local variables
- Scheme will have heap based allocation for local variables
- Imperative languages avoid first class for avoiding heap based local variable allocation.
- Function Object/Object Closure/Functor
- to implement closure in imperative language such as Java, we can store the context in object's field.
example:
interface IntFunc { public int call(int i); } class PlusX implements IntFunc { final int X; PlusX(int n) { x = n; } public int call(int i) { return i + x; } }
Here we are saving the closure context in object field
int X
4.4 Alias
Multiple name bound to same object.
- Overloading: correct function is called according to the type and number of
parameters passed.
- can be determined at compile time.
- Polymorphism: correct function (method) is invoked according to type of
instance (parent or child).
- Determined at runtime.
- Coercion: automatic conversion of type by compiler
- Eg:
+
- if
+
defined only for int and float supplied, coercion will take place - if separate
+
defined for float, by overloading+
for float will be called.
- if
- Eg:
5 Sematic Analysis
- Syntax of language can be defined using context-free grammar
- Semantics cannot be defined using CFG.
- some semantics can be defined, but it is very complicated to do so.
- We describe sematics using Attribute grammar (AG) which is like pseudocode.
- Semantics analysis can be divided into:
- Static: check for semantic error at compile time
- Dynamic: generate code which will check for error during execution time.
5.1 Static Analysis
- Compile time algorithm which predict runtime behavior.
- Example: type checking
5.2 Semantics:
- Operational Semantics
- Denotational Semantics
- Axiomatic Semantics
- Attribute Grammar
5.3 Attribute Grammar
- formal way to define attributes for production of formal grammar, associating these attributes to values.
- Based on context free grammar
- Symbols in grammar are associated with attributes
- Production rules in grammar are associated with semantic rules indicating how the attribute values are computed.
- May include conditions that impose constraint on legal sentences.
5.3.1 Synthesized
- values are calculated (synthesized) only in productions in which their symbol appears on left hand side.
- An attribute grammar in which all attributes are synthesized is said to be S-attributed.
- Value of parent is calculated using children, left is calculated using right.
5.3.2 Inherited
- values are calculated when their symbol is on the right hand side of the current production.
- value is calculated using parent and left sibling. (therefore in above statment is said when symbol is on rhs).
5.3.3 Attribute Flow
- attribute grammar is well defined if its rules determine a unique set of values for the attributes of every possible parse tree.
- non circular: if no attribute ever depends on itself.
- An algorithm that decorates parse tree by invoking the rules of an attribute
grammar in an order consistent with the tree's attribute flow is called a
translation scheme.
- most obvious: make repeated passes over a tree and invoke semantic function whose arguments have all been defined and stopping when no such value changes.
- dynamic: similar to topological sort
- static: based on the grammar
- S-attributed grammars form the simplest of such class.
- attribute flow in an S-attributed grammar is strictly bottom-up
- can be combines with bottom up parsing.
- can be implemented on the fly during LR parse.
- L-attributed grammars can be evaluated by visiting the nodes of the parse
tree in a single left to right, depth first traversal.
- can be implemented on the fly during LL parse.
- in the same order they are visited during a top-down parse.
- L-attributed grammars can be defined more formally as:
- Each synthesized attribute of a left hand side symbol depends only
on that symbol's own inherited attribute or on attributes (synthesized
or inherited) of the production's right hand side symbols.
- either from parent or synthesize through children
- each inherited attribute of a right hand side symbol depends only on
inherited attributes of left hand side symbol or on attributes (synthesized
or inherited) of symbols to its left in the right-hand-side.
- inherited attribute (not synthesized, top down) is either inherited through parent or left sibling.
- Each synthesized attribute of a left hand side symbol depends only
on that symbol's own inherited attribute or on attributes (synthesized
or inherited) of the production's right hand side symbols.
- Every S-attributed grammar is also an L-attributed grammar
- but the reverse is not true, S-attributed grammar do not permit the initialization of attributes on right hand side, so there are L-attributed grammar which are not S-attributed.
- S-attributed grammars form the simplest of such class.
5.4 is generic just sugar coating over instanceof
- no, compiler does static type checking
6 Control Flow
6.1 Precedence and associativity
- The basic arithmetic operator always associate left to right
9 - 3 - 2
is 4 not 8.
6.2 Assignment
- side effect: if a construct influences subsequent computation in any way other than by returning a value for use in the surrounding context.
- Expressions: always produce a value and may or may not have side effect
- statement: executed solely for their side effects and return no useful value.
6.3 References and Values
6.3.1 value model
- l-value: used on left hand side of assignments, expressions that denote locations are referred as l-values
- r-value: expression that denote value, used on right hand side
6.3.2 reference model
- distinction is made between l-values and r-values
- a variable is not a named container for value, it is a named reference to a value
Example:
b := 2; c := b; a := b + c;
Value Model Reference Model we put 2 in b we let b refer to 2 copy it into c let c refer to it also read these values, add them together pass the reference to + operator and and than place result in a then let a refer to the result - in reference model there is only one 2 and all variable having value 2 refer to it.
- in reference model every variable is l-value
6.3.3 Boxing
- Java collections expect Object, primitive datatype are not descendants of Objects so we need to box them before using them in collections.
- Recent version of Java does auto boxing, but returns Object so need to unbox manually.
6.3.4 Initialization
- C, statically allocated variables are initialized to 0.
- Java, C# initializes
null
to fields of classes typed objects.
6.3.5 Ordering within expressions
a - f(b) - c * d; f(a, g(b), h(c));
- we do not know which will evaluate first
- may affect if the function calls have side effects
- may affect compiler optimization of register allocation.
- since compiler can optimize register allocation, the order of evaluation is
left to implementation in some languages.
- except for Java and C#, where it is left to right.
6.3.6 Mathematical Identities
c = b - a + d // rearrange as c = d - a + b
- can cause overflow.
6.3.7 Switch Case
- jump table:
- can consume large space if set of labels is non dense and includes large value ranges.
- sequential testing
- if number of case statement is less
- hashing
- if set of label is large and has missing values and no large ranges
- binary search
6.3.8 Applicative and Normal Order Evaluation
- Applicative Order: evaluating before call
- Normal Order: evaluating when (if) needed
6.3.9 Non determinacy
- in which choice between alternatives (ie between control paths) is deliberately unspecified.
7 Type Systems
- bits are untyped. hardware makes no attempt to keep track of which interpretation
correspond to which location.
- bits can be interpreted as:
- instructions
- address
- float, int, char literal
- assembly is untyped
- higher level languages keep track of types to provide contextual information and error reporting.
- bits can be interpreted as:
- subroutines have types in languages in which they are 1st or 2nd class values.
- strongly typed: if language prohibits application of any operation to any object that is not intended to support that operation.
- statically typed: if language is strongly typed and type checking is done at compilation stage.
- Dynamic type checking: checking at runtime.
7.1 Meaning of Type
- from denotational point of view, type is simply a set of values.
- from structural point of view, type is either one of a small collection of built in types (integer, character, Boolean etc) or composite type created.
7.2 Polymorphism
- "having multiple form"
- Parametric Polymorphism: the code takes in type either explicitly or implicitly.
- Explicit Parametric Polymorphism: generics/templates
- Subtype Polymorphism: the code is designed to work with values of some
type T, but programmer can define additional types to be extension of T and
code will work with these types too.
- Usually found in OO Languages.
7.3 Orthogonality
- Trivial Type: To characterize statement that is executed for its side effect
and has no useful value, provide trivial type.
void
in C,unit
in ML- In pascal, we would have needed to take the return in dummy variable.
7.4 Enumeration
- In C,
enums
are justtypedef
-ed integers - Pascal,
enums
andint
are of different types, mixing them is not allowed.
7.5 Sub-range type
- Why use subrange instead of int:
- helps in documentation
- dynamic check by compiler to see if value is in range
- optimize storage based on range.
7.6 Composite Types
7.6.1 Records (structs)
- a record consists of collection of fields, each of which belong to a simpler type
- similar to mathematical tuple, which corresponds to Cartesian product of types of the fields.
7.6.2 Variant Records (Union)
- similar to Records, but only one field is valid at any given time.
- it is the disjoint union of field types.
7.6.3 Arrays
- function that maps index to member of component type.
7.6.4 Sets
- Collection of distinct elements of base type.
7.6.5 Pointers
- l-values
- Pointers are often but not always implemented as address.
- often used to define recursive data type
7.6.6 Lists
- a list is defined recursively as either an empty list or a pair consisting head element and reference to a sublist.
7.6.7 Files
- similar to arrays, maps index to component type.
7.7 Type Checking
Set of rules for:
- Type Equivalence: when are types of 2 values equal
- Type Compatibility: When can a value of a given type be used in given context
- Type Inference: how to determine type of expression
7.8 Type Equivalence
7.8.1 Structural Equivalence
- based on content of type definition
- two types are same if they consist of the same component in the same order.
- to determine if two types are structurally same, a compiler can expand their definition by replacing any embedded type names with their respective definitions recursively.
7.8.2 Name Equivalence
- based on lexical occurrence of type definitions
- each definition introduces new type
- based on assumption that if programmer writes two type definition, then those are probably meant to represent different types.
- strict name equivalence: aliased types are considered distinct
- loose name equivalence: aliased types are considered equivalent
Ada gives both options
subtype elemType is integer; -- loose type equivalence type person is new string; -- strict type equivalence, new makes difference
7.9 Type Conversion and Cast
depending on the types involved, conversion may or may not require code to be executed during runtime:
- Types structurally equivalent but language uses name equivalence: same low level representation, no coded needed.
- types have different set of values but intersecting values are represented in the same way. (Ex: one maybe subrange of another). If the value assigned is not expected then code must be executed.
- Types have different low level representation but can define correspondence
in their values, (Ex: int and float). Runtime code is needed for checking
and conversion.
- check might be needed to see if the value is within the bound
7.9.1 Nonconverting Type Cast
- changing the type without changing underlying implementation
- ie interpret the bytes of a value as if they were another type
- called as nonconverting type cast or type pun
reinterpret_cast
in C++- In C, this can be done using:
r = *((float *) &m);
, take address of m (int) convert it into address of float and then dereference it.
7.9.2 Coercion
- implicit conversion to expected type by language.
7.10 Universal Reference type
void *
Object
7.11 Type inference
- sum of range: range may be out of bound, so int
7.12 Parametric Polymorphism
7.12.1 Generic Implementation
- C++ template implementation is purely static.
- compiler creates a separate copy of the code for every instance.
- Java generics have same code for all types at run time
- compiler does the casting
7.13 Equality Testing and assignment
- Deep Copy: copy object and all other object pointed by it recursively
- shallow copy: copy object and references as it is.
8 Composite Types
8.1 Records (Structures)
- heterogeneous types
- alignment restriction may lead to holes in memory layout
- can use pragma
pack
to give hint to compiler to optimize for space rather than speed- packing may lead to multi-instruction sequence to retrieve pieces of field from memory and then re-assemble it.
- languages support assigning one record to another but not testing
- can use
block_copy
to assign - but
block_compare
may return incorrect value due to garbage in holes - easy workaround - sort fields in ascending size.
- may be done by compiler which may be transparent to programmer
- reorganization may be helpful if we can group frequently accessed fields together so that they lie in same cache lines.
- C/C++ allocates memory in the order defined by programmer
- can use pragma
8.2 Unions
- overall size of union is of the largest member
- only one of the member is active at a time
- unions allow the same set of bytes to be interpreted in different ways at different times
8.3 Arrays
- homogeneous
- mapping index to component type
- some languages allow non discrete index
- can be implemented using hash map or search trees
8.3.1 Storage allocation
- Stack Allocation
- Dope Vector
- datastructure used to hold information about data object about its memory layout
- has information about lower bound of each dimension
- size
- the dope vector for record indicates the offset of each field from beginning of the record.
- C supports dynamic shape for both parameters and local variables
- to ensure local object can be found using a known offset from frame pointer
the stack is divided into two parts:
- Fixed sized part: variable whose size is known statically goes here
- Variable sized part: whose size is not know until elaboration time goes here along with pointer to it and dope vector goes to fixed sized part
- to ensure local object can be found using a known offset from frame pointer
the stack is divided into two parts:
- Dope Vector
- Heap Allocation
- if number of dimension of dynamic array is know statically then dope vector can be kept with pointer to data on stack. If dimension is not know then vector is kept at the beginning of the heap block instead.
- Memory layout for multi-dimension array
- Column Major: columns kept together
- Row Major: rows kept together
- Affect cache lines, if elements are accessed in a way that results in cache miss then it will affect performance.
- Fortran has column major
- 2d array may be implemented as array of pointers.
- more space required
- allows rows of different sizes
- Summary
Lifetime shape allocation region global static static global memory local static stack local bound at elaboration time stack with indirection (dope vector) arbitrary bound at elaboration time heap (size doesn't change) arbitrary arbitrary vector - In third case, space is allocated at end of stack (after temporaries), dope vector contains pointer to base address.
8.4 Strings
- Dynamic strings easier to implement since they do not have reference to other objects, copying is easy.
8.5 Sets
- Unordered, distinct elements
- can be implemented:
- using bit-vector: 1 if present 0 otherwise
- ordered list (binary search)
- unordered list (hash map)
8.6 Pointers and recursive types
- pointers needed only in language that use value model
- Pointer is a high level concept: a reference to an object
- address is a low level concept
- Pointers are often implemented as address but not always
8.6.1 Value Model
A = B
, puts value of A in B
8.6.2 Reference Model
A = B
, makes A refer to the same object as B
8.6.3 Dangling Pointers
- Tombstone: indirection is marked in a such a way that it reflects if the object is alive
- Lock and Keys: have a word to each pointer and object and word must match for the pointer to be valid.
8.6.4 Garbage Collection
- Reference counting
- Smart Pointers
unique_pointer
: only one owner, when assigned, ownership is transferedshared_ptr
: has tombstone like intermediate object which keeps count of destructors etc.weak_pointer
: similar to shared pointer without increase reference count
8.7 Lists
8.8 Files
9 Subroutine and Control Abstraction
- Side Effects: A construct has side effect if it influences the subsequent computation in any way other than returning a value
- Referential transparency: The value of an expression depends only on its referencing environment and not when it is evaluated.
- Expression always produce a value, may or may not have side effects
- statement: executed solely for side effects.
- Boxing: automatic conversion between value and reference models
10 TODO Scott Supplement
- Error Recovery
11 λ Calculus
11.1 Functions
- mapping from domain to range
- can be denoted as set of pairs: sqrt = {(x, y) in R * R | x = y2 }
- λ calculus is calculus of anonymous functions
- a method of representing functions
- a set of conversion rules for syntactically transforming them.
- λ x. * 2 x
- λ x. * 2 x is a lambda abstraction
- x is bound variable
- * 2x is body
- λ x. λ y. * (+ x y) 2
In scheme:
(lambda x (lambda y (* (+ x y) 2)))
- a function of x which returns a function of y which returns * (+ x y)
- λ y. * (+ x y)
- y is bound in this expression
- x is free
- λ x. λ y. * (+ x y) 2
- both x and y are bound in this expression
11.2 Pure λ calculus
- Everything is a function
- Example: Natural numbers can be obtained from:
- constant function 0
- successor function S(n)
- Example: 3 = S(S(S(0)))
11.3 Conventions to avoid paranthesis
- Application associates to the left
- (f g h) → (f g) h
- Application has higher precedence than abstraction
- λ x. A B → λ x. (A B)
- Abbreviate nested expressions with a list
- λ x y z. E → (λ x. (λ y. (λ z. E)))
11.4 Evaluation of λ expression (Reduction)
11.4.1 Reduction Rules
- δ (delta)
- used for built in constant
- (+ 2 3) →δ 5
- β (beta)
- replace all occurrences of bound variable in the body with supplied parameter.
- ((λ x. E1) E2)
- Replace all occurrences of E1 with E2 in function body
- Ex: (λ x. * 2 x) 5 →β (* 2 5)
- this can be further evaluated →δ 10
- α (alpha)
- rename bound variable
- (λ x. E(x)) →α (λ y. E(y))
- Ex: (λ x. x + 1) →α (λ y. y + 1)
- η (eta)
- remove unnecessary abstractions
- (λ x. f) →η f
- (λ x. + 2 3) →η (+ 2 3) →δ 5
11.4.2 Normal Form
A λ expression is in normal form if it cannot be further reduced using β, δ or η reduction.
- Can every λ expression be reduced to normal form
- (λ x. x x) (λ x. x x)
- No
- Since λ expression can model never terminating expression, it is as powerful as Turing machine.
- Is there more than one way to reduce a λ expression
- Yes
- ((λ x. + (λ y. 4) x) 3)
- →β (+ (λ y. 4) 3) →η (+ 4 3) →δ 7
- →η ((λ x. + 4 x) 3) →β (+ 4 3) →δ 7
- if there is more than one to reduce, does each lead to same normal form expression
- Yes
- Is there a reduction strategy that guarantees that a normal form will be produced (if there is one)
- Yes
- Standardization Theorem
- If an expression has normal form, then a leftmost outermost reduction at each state is guaranteed to reach that normal form.
- δ or β does not always reduce: ((λ x. λ y. y) ((λ z. z z)(λ z. z z)))
- taking outermost redex →η (λ y. y)
- taking innnermost redex →β ((λ x. λ y. y) ((λ z. z z)(λ z. z z))) →β …
11.5 Reduction Strategies
11.5.1 Normal Order
- always choose leftmost outermost redex
- normal order will yield the normal form if the expression has one
- corresponds to call by name
- parameter is passed as unevaluated expression that is evaluated in the body of the function being executed each time the parameter is referenced.
- "always choose leftmost outermost redex": prefer function application over parameter evaluation.
11.5.2 Applicative Order
- always choose leftmost innermost redex
- corresponds to call by value
- an argument is evaluated before the function is applied
- "always choose leftmost innermost redex": choose parameter evaluation over function application.
11.5.3 Comparison
- ∞ computation
(define eternity (λ () (eternity))) (define zero (λ (x) 0)) (zero eternity)
- normal order: gives 0 immediately
- applicative order: in this case, may not return
- Argument Evaluation
fun sqr x = x * x; sqr(sqr(sqr(2))) (* in case of normal order evaluation = sqr(sqr(2)) * sqr(sqr(2)) = sqr(2) * sqr(2) * sq(sqr(2)) ... redundant computations *) (* in case of applicative order reduction = sqr(sqr(4)) = sqr(16) = 256 *)
- Graph Reduction (Lazy Evaluation)
- has pointer to expression
- save the result of expression after evaluation
- can pass argument unevaluated – normal order
- can save against long/∞ computation
11.6 Booleans and condition functional representation
- true define true = λ x. λ y. x
- false define false = λ x. λ y. y
- if λ p. λ x. λ y. p x y
11.6.1 if true a b = a
if true a b
(λ p. λ x. λ y. p x y) true a b
→β (λ x. λ y. true x y) a b
→β (true a b)
replace true with its definition
(λ x. λ y. x) a b
→β a
11.6.2 if false a b = b
if false a b
(λ p. λ x. λ y. p x y) false a b
→β (false a b)
replace false with its definition
(λ x. λ y. y) a b
→β b
11.7 Y Combinator
- defining recursive functions without define
11.7.1 Fixed Points
a fixed point of f is a value xfix that satisfies xfix = f (xfix)
In mathematics, a fixed point (also known as invariant point), of a function is an element in the domain of the function which is mapped to itself by the function. - wiki
- example: f = λ x. * x x
- fixed points: 0 and 1
- A function can have:
- no fixed points (f x = x + 1)
- a unique fixed point (f x = -x)
- a finite number of fixed points (f x = * x x)
- ∞ number of fixed points (f x = x)
12 JVM Overview
12.1 Types
int | I |
boolean | Z |
void | V |
String | Ljava/lang/String; |
String[] | [Ljava/lang/String; |
ArrayList | Ljava/util/ArrayList; |
… |
12.2 Method Signature
- '(' + Param Types + ')' + return Type
- Example:
void main (String[] args)
=([Ljava/lang/String;)V
12.3 Variables
- Static
getstatic
putstatic
- Non-static (fields)
getfield
putfield
- local variables
- stores in array of local variables
- accessed using slot index
- if non static method, then
this
stored in 0 index - followed by parameters
- followed by local variables
- if non static method, then
- accessed using
load
andstore
12.4 Method Invocation
invokevirtual
- class methods
invokestatic
- for static methods
invokespecial
- during object initialization
13 Denotational Semantics
13.1 Semantics
- Semantics is expressed by defining:
- Semantic domain
- Semantic mapping
- M: L → S, where L is language (syntactic domain) and S is semantic domain
- In denotational semantics:
- Syntactic domain: Elements of abstract syntax
- Semantic domain: abstract mathematical objects
- Semantic functions: map elements of syntactic domain to elements of semantic domain
13.2 Ex: Simple Stack Machine
13.2.1 Abstract Syntax
Program | Instruction |
Instruction | Instruction |
load Number |
|
+ |
|
* |
|
out |
13.2.2 Syntactic Domain
- P: Program
- I: Instruction
- N: Number
13.2.3 Semantic Domain
Integer | |
Output | [Integer] |
State | [Integer] * [Integer] (Stack * Output) |
13.2.4 Semantic Functions
meaning | Program → [Integer] |
execute | Instruction → State → State |
value | Number → Integer |
- Function Definitions
execute[|load N|] (stack, output) = (value[|N|] :: stack, output) execute[|+|] (x::y::xs, output) = ((x + y)::xs, output) execute[|*|] (x::y::xs, output) = ((x * y)::xs, output execute[|out|] (x::xs, output) = (xs, output @ x) execute[|I1 I2|] = execute[|I2|] . execute[|I1|] meaning[|P|] = execute[|I|] [] [] //empty stack and output
- note:
execute [|I1 I2|] = execute[|I2|] . execute[|I1|]
is function composition- output of
execute[|I1|]
is fed toexecute[|I2|]
as input f . g
=f(g)
- output of
- note:
13.3 Modeling Memory
Using simple imperative language
13.3.1 Abstract Syntax
Program | Command |
Command | Command Command |
Identifier := Expression | |
skip | |
Expression | Identifier |
Number | |
Expression + Expression |
13.3.2 Syntactic Domain
P | Program |
C | Command |
E | Expression |
N | Number |
Id | Identifier |
13.3.3 Semantic Domain
Integer | |
Store | Identifier → (Integer + undefined) |
- Memory is modeled as function store from identifier to values
13.3.4 Semantic functions
meaning | Program → Store |
evaluate | Expression → Store → Integer |
execute | Expression → Store → Store |
value | Number → Integer |
empty | λ Id. undefined |
update | (Store, Id, Integer) → Store |
- there is no out, we'll see the effect as side effect on memory
store
store
is a function which maps Id to (Integer,undefined
)- Base case for
store
isempty
, where everything is mapped toundefined
- sto0: - →
undefined
- sto1: maps one id to integer, rest delegated to sto0
sto2: maps another id to integer, rest delegated to sto1, so on…
update (sto_n, id, int_i) x = if (x = id) then int_i else sto_n x (* here, (update (sto_n, id, val)) returns a function which maps id -> val, else delegated to sto_n *)
evaluate
evaluate[|N|] sto = value [|N|] evaluate[|id|] sto = sto(Id) evaluate[|E1 + E2|] sto = evaluate[|E1|] sto + evaluate[|E2|] sto evaluate[|skip|] sto = sto evaluate[|id := E|] sto = update(sto, id, evaluate[|E|] sto) evaluate[|C1 C2|] sto = evaluate[|C2|].evaluate[|C1|] meaning[|C|] = evaluate[|C|] empty
13.3.5 Meaning
- the meaning of program is a function
- in the simple stack machine,
execute
was function mapping Instruction → State → State- we start with empty stack and output and end with entries in them
- here there is no output, meaning of the program is store, which maps
expressions → Store → Store
- we start with empty store (
empty
function) and end with store which maps the identifiers to integers.
- we start with empty store (
- in the simple stack machine,
13.3.6 if
statements
execute [|if E then C1 else C2|] state = if (evaluate[|E|] state) then evaluate[|C1|] state else evaluate[|C2|]
13.3.7 while
statements
execute [|while E do C|] (sto, in, out) //unroll the loop if (evaluate[|E|] (sto, in, out)) then execute [|while E do C|] . evaluate C else skip // loop is a fixed point let loop = λ s:state . if (evaluate[|E|] s) then (loop (execute [|C|] s)) // ← recursively call loop else s in loop end
13.3.8 Nested Scopes
- Environment and Location
Environment identifier → location (address) Store location → value env location + unbound store location → storableType + unused + undefined unused
: not to bound to an identifierundefined
: no value stored yet
extendEnv
extendEnv (env, id0, loc) id1 = if (id0 = id1) then loc else env id1
update
storeupdate (sto, loc0, val) loc1 = if loc1 = loc0 then val else sto loc1
14 Axiomatic Semantics
14.1 Assertions
- Assertion
true
is satisfied by any program state - Assertion
false
is not satisfied by any program state
14.2 Axioms
14.2.1 Assignment Axiom
{ Q[V := E] and E is defined } V := E { Q }
Whatever is true about E beforehand, is true about V afterwards.
- Axioms can be used:
to calculate precondition
{ x = 0 } x := x + 1 { x = 1 } // substitute statement in post condition x = 1 x + 1 = 1 x = 0 = precondition //statement is valid
to calculate the RHS of statement
{ true } x = E { x = 4 } E = 4
14.2.2 Stronger and Weaker assertions
- A ⊃ B (A superset B)
- A implies B
- A is stronger than B
- A has more constraints, if A is true then B is too.
- Eg:
x > 0 ⊃ x ≥ 0
- set of values allowed by
x > 0
= {1, 2, 3 …} - set of values allowed by
x ≥ 0
= {0, 1, 2, 3 …} x > 0
has more constraints thanx ≥ 0
x > 0
is stronger thanx ≥ 0
- set of values allowed by
false
is stronger than P (false
not satisfied by any state)- P stronger than
true
(true
is satisfied by all state) - We want precondition to be weakest so that more number of states can be valid for that statement.
- Assignment axiom gives weakest precondition for given post condition
14.2.3 Inference Rules
for P/Q
, if we know whatever above line is true (P) then we can conclude everything
below the line is true (Q).
- Strengthen the precondition
P ⊃ Q, {Q} S {R} —————–---------- {P} S {R}
- P is stronger than Q
- S in a state satisfying Q will result in state satisfying R
- anywhere where P is satisfied, Q is also satisfied
- ∴ the final state will satisfy R
- this may seem counter intuitive if we consider S to a function which maps all pre to post condition
- S is just a function which will get R if any one state of precondition is true.
- The states allowed by P is subset of state allowed by Q (since P has stronger constraints)
- Weaken the postcondition
P ⊃ R, {Q} S {P} ---------------- {Q} S {R}
- P is stronger than R
- S gets from Q to P
- any state satisfying P will satisfy R
- ∴ S gets Q to R
- Sequence Rule
{P} S0 {Q}, {Q} S1 {R} ---------------------- {P} S0 S1 {R]
14.2.4 Some facts
1. A and B ⊇ A 2. A ⊇ A or B 3. {false} S {P} 4. {P} S {true}
my intuition:
- B will either have some contraints or no constraints. Resulting set thus ⊇
- Any element in set
A or B
will be true if present in A, but it may be also true if present in B. So A has more constraints thanA or B
- false has highest constraints
- true has no constraints
14.2.5 Application in programming languages
skip
{P} skip {Q} P ⊃ Q ----- {P} skip {Q}
if
{P and B} S0 {Q}, (P and ¬ B} ⊃ Q) —————–---------------------------- {P} if B then S0 {Q}
if-else
{P and B} S0 {Q}, {P and ¬ B} S1 {Q} ------------------------------------ {P} if B then S0 else S1 {Q}
note:
if
≈if-else
withelse
havingskip
14.2.6 Some rules about ⊃
A ⊃ C ———-------- A and B ⊃ C A or B ⊃ C ------------ A ⊃ C, B ⊃ C A ⊃ C ---------- A ⊃ B or C A ⊃ B, A ⊃ C ------------ A ⊃ B and C A ⊃ B and C ------------ A ⊃ B, A ⊃ C A ⊃ C, B ⊃ C ------------ A or B ⊃ C