Lexical Analyser PPTs for Third Lease Computer Sc. and Engineering

DrRajurkarArchanaMil 35 views 68 slides Sep 29, 2024
Slide 1
Slide 1 of 68
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68

About This Presentation

Lexical Analyser PPTs for Third Year Computer Science and Engineering Students


Slide Content

Lexical Analyzer Lexical Analyzer reads the source program character by character to produce tokens. Normally a lexical analyzer doesn’t return a list of tokens at one shot, it returns a token when the parser asks a token from it. Compiler Construction 1 Lexical Analyzer Parser source program token get next token Chapter 2 Symbol Table

Since the LA is part of compiler that reads the source text, it may perform certain other tasks such as, Stripping out comments and whitespaces(blank, newline, tab) Correlating error messages generated by the compiler with the source program e.g. LA may keep track of the number of newline characters seen, so it can associate a line number with each error message. In some compilers, the LA makes a copy of the source program with the error messages inserted at the appropriate positions. If the source program uses a macro-preprocessor, the expantion of macros may also be performed by the LA Sometimes, LA are divided into cascade of two processes A) Scanning B) Lexical anlysis Compiler Construction 2

Compiler Construction 3 Issues in Lexical Analysis Reasons for separating the analysis phase of compiling into lexical analysis and parsing Simpler Design Compiler Efficiency Is Improved Compiler Portability Is Enhanced

Tokens, Patterns and Lexemes Token represents a set of strings described by a pattern. Identifier represents a set of strings which start with a letter continues with letters and digits The actual string ( newval ) is called as lexeme . Tokens: identifier, number, addop , delimeter , … Since a token can represent more than one lexeme, additional information should be held for that specific lexeme. This additional information is called as the attribute of the token. For simplicity, a token may have a single attribute which holds the required information for that token. Compiler Construction 4

Compiler Construction 5 For identifiers, this attribute is a pointer to the symbol table, and the symbol table holds the actual attributes for that token. Some attributes: < id,attr > where attr is pointer to the symbol table < assgop ,_> no attribute is needed (if there is only one assignment operator) < num,val > where val is the actual value of the number. Token type and its attribute uniquely identifies a lexeme. Regular expressions are widely used to specify patterns.

Compiler Construction 6 Example E = M * C ** 2 Token names and associated attributes are <id, pointer to symbol table entry for E> < assign_op > <id, pointer to symbol table entry for M> < mult_op > <id, pointer to symbol table entry for C> < exp_op > < number,integer value 2>

Attributes of Tokens y := 31 + 28*x <id, “y”> <assign, > <num, 31> <+, > <num, 28> <*, > <id, “x”> Token Attribute Compiler Construction 7 Lexical Analyzer Parser

Compiler Construction 8 Tokens, Patterns, and Lexemes A token is a classification of lexical units – For example: id and num Lexemes are the specific character strings that make up a token – For example: abc and 123 Patterns are rules describing the set of lexemes belonging to a token – For example: “ letter followed by letters and digits” and “ non-empty sequence of digits ”

Compiler Construction 9 Lexical Errors fi ( a == f(x) )… It may be misspelling of the keyword if It may be undeclared function identifier Lexical analyzer must return the token id to the parser and some other phase handle the error Panic Mode Recovery – Delete successive characters from the remaining input, until the LA can find well formed token

Compiler Construction 10 Other possible recovery actions Delete one character from the remaining input Insert a missing character into the remaining input Replace a character by another character Transpose two adjacent characters

Compiler Construction 11 Input Buffering Three approaches to the implementation of a lexical analyzer Use of lexical analyzer generator e.g. Lex compiler Write the Lexical Analyzer in a conventional system programming languages Write the Lexical Analyzer in assembly language using I/O facilities to read the input

Compiler Construction 12 Buffer Pairs E = M * C * * 2 eof Each buffer is of the same size N, and N is usually size of a disk block Using one system read command N characters are read eof marks the end of source file Two pointers lexemeBegin and forward are maintained

Compiler Construction 13 Sentinels Each time we advance forward pointer we must check for two tests 1. One for end of buffer 2. Other to determine what character is read These two tests can be combined by using sentinel Sentinel is a special character that can be part of source program e.g. eof

Compiler Construction 14 Can we run out of buffer space? In modern languages lexemes are short and one or two character lookahead is sufficient The buffer size in thousand is ample and the double buffer scheme works without problem To avoid problem with long character strings, we can treat them as concatenation of components In JAVA a+ operator is used to represents long strings on different lines PL/I do not treat keywords as reserved Problem with DECLARE( Arg1, Arg2……

Specification of Tokens Regular expressions are an important notation for specifying lexeme patterns Compiler Construction 15

Terminology of Languages Alphabet : a finite set of symbols (ASCII characters) String : Finite sequence of symbols on an alphabet Sentence and word are also used in terms of string  is the empty string |s| is the length of string s . Language: sets of strings over some fixed alphabet  the empty set is a language. {} the set containing empty string is a language The set of well-formed C programs is a language The set of all possible identifiers is a language. Compiler Construction 16

Operations on Languages Concatenation: L 1 L 2 = { s 1 s 2 | s 1  L 1 and s 2  L 2 } Union L 1  L 2 = { s | s  L 1 or s  L 2 } Exponentiation: L = {} L 1 = L L 2 = LL Kleene Closure L * = Positive Closure L + =

Example L 1 = { a,b,c,d } L 2 = {1,2} L 1 L 2 = {a1,a2,b1,b2,c1,c2,d1,d2} L 1  L 2 = {a,b,c,d,1,2} L 1 3 = all strings with length three (using a,b,c,d } L 1 * = all strings using letters a,b,c,d and empty string L 1 + = doesn’t include the empty string Compiler Construction 18

Regular Expressions We use regular expressions to describe tokens of a programming language. A regular expression is built up of simpler regular expressions (using defining rules) Each regular expression denotes a language. A language denoted by a regular expression is called as a regular set. Compiler Construction 19

Regular Expressions (Rules) Regular expressions over alphabet  Reg. Expr Language it denotes  {} a  {a} (r 1 ) | (r 2 ) L(r 1 )  L(r 2 ) (r 1 ) (r 2 ) L(r 1 ) L(r 2 ) (r) * (L(r)) * (r) L(r) Compiler Construction 20

Regular Expressions (cont.) We may remove parentheses by using precedence rules. * highest concatenation next | lowest ab * |c means (a(b) * )|(c) Ex:  = {0,1} 0|1 => {0,1} (0|1)(0|1) => {00,01,10,11} * => { ,0,00,000,0000,....} (0|1) * => all strings with 0 and 1, including the empty string Compiler Construction 21

Regular Definitions To write regular expression for some languages can be difficult, because their regular expressions can be quite complex. In those cases, we may use regular definitions . We can give names to regular expressions, and we can use these names as symbols to define other regular expressions. A regular definition is a sequence of the definitions of the form: d 1  r 1 where d i is a distinct name and d 2  r 2 r i is a regular expression over the alphabet   {d 1 ,d 2 ,...,d i-1 } d n  r n basic symbols previously defined names Compiler Construction 22

Regular Definitions (cont.) Ex: Identifiers in Pascal letter  A | B | ... | Z | a | b | ... | z digit  0 | 1 | ... | 9 id  letter (letter | digit ) * If we try to write the regular expression representing identifiers without using regular definitions, that regular expression will be complex. (A|...| Z|a |...|z) ( (A|...| Z|a |...|z) | (0|...|9) ) * Compiler Construction 23

Regular Definitions (cont.) Ex: Unsigned numbers in Pascal digit  0 | 1 | ... | 9 digits  digit + opt-exponent  ( E ( + | - | ε ) digits ) | ε opt-fraction  ( . digits ) | ε unsigned-num  digits opt-fraction opt-exponent Compiler Construction 24

Compiler Construction 25 Problems 1. Describe the languages denoted by the following regular expressions a( a | b )*a Solution : L(r ) = { aa , aaa , aba , aaaa , aaba , abaa , abba , ..} From above we can say that the above language says that it is “ Language consisting of strings with a’s and b’s always starting and ending with a. .

Compiler Construction 26 (( ε | a) b*)* Solution: L(r ) = { ε , a, b, aa , ab , bb, abb , abab , …..} From above, we can say that the above language says that it is “Language consisting of strings with no consecutive a’s when there is combination of a’s and b’s in the string

Problems Compiler Construction 27 Write regular definitions for the following languages All strings of letters that contain the five vowels in order Non_vowel → [b-d B-D f-h F-H j-n J-N p-t P-T v-z V-Z] String → (non-vowel)*([a| A]) + (non-vowel)*([ e|E ]) + (non-vowel)*([ i | I]) + (non-vowel)*([o| O]) + (non-vowel)*([u| U]) + b) All strings of lowercase letters in which the letters are in ascending lexicographic order

Compiler Construction 28 Recognition of Tokens Consider the following grammar Stmt → if expr then stmt | if expr then stmt else stmt | ε expr → term relop term | term term → id | number

Compiler Construction 29 Patterns for Tokens digit → [0-9] digits → digit + number → digits(. Digits)? (E[+-]? Digits)? letter → [A- Za -z] id → letter (letter | digit)* if → if then → then else → else relop → < | > | <= | >= | = | <>

Transition Diagram As an intermediate step in the construction of LA patterns are first converted into stylized flowcharts called transition diagrams RE patterns are converted to TD by hand.There are mechanical way to construct these diagrams from Res TD have set of nodes, called states. Each state represents a condition that could occur during the process of scanning Edges are directed from one state of TD to another. Each edge is labeled by a symbol or set of symbols. Certain states are called accepting or final. These states indicate that a lexeme has been found. Compiler Construction 30

Compiler Construction 31 If it is necessary to retract forward pointer one position then * is placed near that accepting state. 1 2 3 4 7 8 5 6 < = > return (relop LE ) return (relop NE ) return (relop LT ) return (relop GE ) return (relop GT ) other return (relop EQ ) * = > = other * start

Recognition of reserved words and identifiers Compiler Construction 32 10 11 start letter Letter or digit other * return(getToken(), installID()) There are two ways that we can handle reserved words that look like ids Install the reserved words in the symbol table initially A field of the symbol table entry indicates that which token they represent Create separate transition diagram for each keyword. Such transition diagram consists of states representing the situation after each successive letter of the keyword is seen, followed by a test for a “ nonletter -or-digit start t h e n nonlet/digit *

Sketch of implementation of relop transition diagram TOKEN getRelop () { Token retToken = new(RELOP); while(1) { /* repeat charater processing until a return or failure occurs*/ switch(state){ case 0: c = nextChar (); if (c == ‘<‘ ) state =1; else if ( c == ‘=‘ ) state = 5; else if ( c == ‘>’ ) state = 6; else fail (); /* lexeme is not a relop */ break; case 1: … case 2: … case 8: retract(); retToken.attribute = GT; return( retToken ); } } } Compiler Construction 33

Finite Automata A recognizer for a language is a program that takes a string x, and answers “yes” if x is a sentence of that language, and “no” otherwise. We call the recognizer of the tokens as a finite automaton . A finite automaton can be: deterministic(DFA) or non-deterministic (NFA) This means that we may use a deterministic or non-deterministic automaton as a lexical analyzer Compiler Construction 34

Both deterministic and non-deterministic finite automaton recognize regular sets. Which one? deterministic – faster recognizer, but it may take more space non-deterministic – slower, but it may take less space Deterministic automatons are widely used lexical analyzers. First, we define regular expressions for tokens; Then we convert them into a DFA to get a lexical analyzer for our tokens. Algorithm1: Regular Expression  NFA  DFA (two steps: first to NFA, then to DFA) Algorithm2: Regular Expression  DFA (directly convert a regular expression into a DFA) Compiler Construction 35

Non-Deterministic Finite Automaton (NFA) A non-deterministic finite automaton (NFA) is a mathematical model that consists of: S - a set of states  - a set of input symbols (alphabet) move – a transition function move to map state-symbol pairs to sets of states. s - a start (initial) state F – a set of accepting states (final states) Compiler Construction 36

- transitions are allowed in NFAs. In other words, we can move from one state to another one without consuming any symbol A NFA accepts a string x, if and only if there is a path from the starting state to one of accepting states such that edge labels along this path spell out x Compiler Construction 37

NFA (Example) Compiler Construction 38 1 2 a b start a b 0 is the start state s {2} is the set of final states F = { a,b } S = {0,1,2} Transition Function: a b 0 {0,1} {0} 1 _ {2} 2 _ _ Transition graph of the NFA The language recognized by this NFA is (a|b) * a b

Deterministic Finite Automaton (DFA) Compiler Construction 39 It is a special form of a NFA no state has - transition for each symbol a and state s, there is at most one labeled edge a leaving s. i.e. transition function is from pair of state-symbol to state (not set of states) 1 2 b a a b The language recognized by this DFA is also (a|b) * a b b a

Implementing a DFA Le us assume that the end of a string is marked with a special symbol (say eos ). The algorithm for recognition will be as follows: (an efficient implementation) s  s { start from the initial state } c  nextchar { get the next character from the input string } while (c != eos ) do { do until the end of the string } begin s  move( s,c ) { transition function } c  nextchar end if (s in F) then { if s is an accepting state } return “yes” else return “no” Compiler Construction 40

Implementing a NFA S   -closure({s }) { set all of states can be accessible from s by -transitions } c  nextchar while (c != eos ) { begin s   -closure(move( S,c )) { set of all states can be accessible from a state in S c  nextchar by a transition on c } end if (S F != ) then { if S contains an accepting state } return “yes” else return “no” This algorithm is not efficient. Compiler Construction 41

Converting A Regular Expression into A NFA (Thomson’s Construction) This is one way to convert a regular expression into a NFA. There can be other ways (much efficient) for the conversion. Thomson’s Construction is simple and systematic method. It guarantees that the resulting NFA will have exactly one final state, and one start state. Construction starts from simplest parts (alphabet symbols). To create a NFA for a complex regular expression, NFAs of its sub-expressions are combined to create its NFA, Compiler Construction 42

Thomson’s Construction (cont.) Compiler Construction 43 To recognize an empty string  To recognize a symbol a in the alphabet  If N(r 1 ) and N(r 2 ) are NFAs for regular expressions r 1 and r 2 For regular expression r 1 | r 2 a f i f i  N(r 2 ) N(r 1 ) f i NFA for r 1 | r 2    

Thomson’s Construction (cont.) Compiler Construction 44 For regular expression r 1 r 2 i f N(r 2 ) N(r 1 ) NFA for r 1 r 2 Final state of N(r 2 ) become final state of N(r 1 r 2 ) For regular expression r * N(r) i f NFA for r *    

Thomson’s Construction (Example - (a|b) * a ) Compiler Construction 45 a: a b b: (a | b) a b     b     a    (a|b) *   b     a    a (a|b) * a

Converting a NFA into a DFA (subset construction) The general idea behind the subset construction is that each state of DFA corresponds to a set of NFA states After reading input a1a2….an, the DFA is in that state which correponds to the set of states that the NFA can reach, from its start state It is possible that the number of DFA states is exponential in the number of NFA states In practice the NFA and DFA have approximately the same number of states Compiler Construction 46

Converting a NFA into a DFA (subset construction) The subset construction algorithm performs three operations on NFA states  -closure(s) – Set of NFA states reachable from NFA state s on -transitions alone  -closure(T) – Set of NFA states reachable from some NFA state s in set T on -transitions alone move( T,a ) – Set of NFA states to which there is a transition on input symbol a from some state s in T Compiler Construction 47

Converting a NFA into a DFA (subset construction) T is set of states of NFA put T=  -closure({s }) as an unmarked state into the set of DFA (DS) while (there is one unmarked T in DS) do begin mark T for each input symbol a do begin U   -closure(move( T,a )) if (U is not in DS) then add U into DS as an unmarked state Dtran [ T,a ]  U end end a state S in DS is an accepting state of DFA if a state in S is an accepting state of NFA the start state of DFA is  -closure({s }) Compiler Construction 48 set of states to which there is a transition on a from a state s in S 1  -closure({s }) is the set of all states can be accessible from s by - transition.

Converting a NFA into a DFA (Example) S0 =  -closure({0}) = {0,1,2,4,7} S0 into DS as an unmarked state  mark S0  -closure(move(S0,a)) =  -closure({3,8}) = {1,2,3,4,6,7,8} = S1 S1 into DS  -closure(move(S0,b)) =  -closure({5}) = {1,2,4,5,6,7} = S2 S2 into DS Dtran [S0,a]  S1 transfunc [S0,b]  S2  mark S1  -closure(move(S1,a)) =  -closure({3,8}) = {1,2,3,4,6,7,8} = S1  -closure(move(S1,b)) =  -closure({5}) = {1,2,4,5,6,7} = S2 b     a    a

Converting a NFA into a DFA (Example) contd... Dtran [S1,a]  S1 Dtran [S1,b]  S2  mark S2  -closure(move(S2,a)) =  -closure({3,8}) = {1,2,3,4,6,7,8} = S1  -closure(move(S2,b)) =  -closure({5}) = {1,2,4,5,6,7} = S2 Dtran [S2,a]  S1 Dtran [S2,b]  S2 Compiler Construction 50

Converting a NFA into a DFA (Example – cont.) Compiler Construction 51 S is the start state of DFA since 0 is a member of S ={0,1,2,4,7} S 1 is an accepting state of DFA since 8 is a member of S 1 = {1,2,3,4,6,7,8} b a a b b a S 1 S 2 S Transition Table for DFA D NFA State DFA State a b {0,1,2,4,7} S0 S1 S2 (1,2,3,4,6,7,8} S1 S1 S2 {1,2,4,5,6,7} S2 S1 S2

A Language for Specifying Lexical Analyzers Several tools have been built for constructing LA from Res Lex , has been widely used to specify Las for a variety of languages The tool is referred as Lex compiler and its input specifications is called Lex language First, a specification of a LA is prepared by creating a program lex.1 in the lex language Compiler Construction 52

Compiler Construction 53 Then lex.1 is run through the Lex compiler to produce C program lex.yy.c lex.yy.c consists of a tabular representation of a transition diagram constructed from the RE of lex.1 The actions associated with RE in lex.1 are pieces of C code and are carried over diretly to lex.yy.c Finally, lex.yy.c is run through the C compiler to produce an object program a.out .

Compiler Construction 54 Lex source program lex.1 Lex compiler C compiler a.out Lex.yy.c a.out Lex.yy.c Input stream Sequence of tokens Creating Lexical Analyzer with Lex

Lex Specifications Lex program consists of three parts declarations { includes declarations of variables, %% constants and regular definitions} translation rules %% auxiliary procedures { holds procedures needed by the actions} The translation rules of a Lex program are of the form p 1 {action 1 } p 2 {action 2 } Compiler Construction 55

Design of Lexical Analyzer Generator Design of a S/W tool that automatically constructs a lexical analyzer from a program in the Lex language. Specification of a LA of the form p 1 {action 1 } p 2 {action 2 } .. p n {action n } p i is Regular Expression and action i is a program fragment Compiler Construction 56

Our problem is to construct a recognizer that looks for lexemes in the input buffer If more that one pattern matches, the recognizer is to choose the longest lexeme matched If there are two or more patterns that match the longest lexeme, the first listed matching pattern is chosen Compiler Construction 57 Design of Lexical Analyzer Generator (cont.)

Lex Specification Transition table input buffer Compiler Construction 58 Model of Lex Compiler Lex Compiler lexeme FA Simulator Transition Table Schematic Lexical Analyzer

Pattern Matching Based on NFA’s Construct the transition table of NFA N for the composite pattern p 1 | p 2 | p 3 …… | p n Create an NFA N(p i ) for each pattern p i and link it to S the start state Compiler Construction 59 N(p 1 ) N(p 2 ) N( p n ) S

Example We have the following Lex program consisting of three regular expressions a { } /* actions are omitted here */ abb { } a*b + { } Compiler Construction 60 start a 1 2 start a b b 3 4 5 1 6 start b 7 8 a b NFA for a, abb , and a*b

Example Cont. Compiler Construction 61 a a b b 1 2 3 4 5 6 b 7 8 a b ε ε ε Combined NFA recognizing three different patterns 1 3 7 a a b a 2 4 7 7 8 Sequence of sets of states entered in processing input aaba A*b +

Pattern Matching Based on DFA Another approach to the construction of lexical analyzer from Lex specification is to use a DFA When we convert an NFA to DFA using subset construction algorithm, there may be several accepting states in given subset of NFA states In such situation, the accepting state corrosponding to the pattern listed first has priority Following is the transition table of converted DFA Compiler Construction 62

Pattern Matching Based on DFA STATE INPUT SYMBOL a b PATTEN ANNOUNCED 0137 247 8 none 247 7 58 a 8 - 8 a*b + 7 7 8 None 58 - 68 a*b + 68 - 8 abb Compiler Construction 63 Test for strings aaba and aba

Example The input string is aba The DFA starts off in state 0137 On input a it goes to state 247 Then on input b it goes to state 58 On next input a it has no next state We thus have reached termination The last of these includes the accepting NFA state 8 In state 58 the DFA announces that the pattern a*b + has been recognized and selects ab the prefix of input as lexeme. Compiler Construction 64

Minimizing Number of States of a DFA partition the set of states into two groups: G 1 : set of accepting states G 2 : set of non-accepting states For each new group G partition G into subgroups such that states s 1 and s 2 are in the same group if for all input symbols a, states s 1 and s 2 have transitions to states in the same group. Start state of the minimized DFA is the group containing the start state of the original DFA. Accepting states of the minimized DFA are the groups containing the accepting states of the original DFA. Compiler Construction 65

Algorithm : Minimizing the number of states of a DFA Input : A DFA D with set of states S, input alphabet Σ , start state s 0, and set of accepting states F’ Output : A DFA D’ accepting the same language as D and having as few states as possible Method : Start with an initial partition π with two groups, F and S – F, the accepting and nonaccepting states of D Apply following procedure to construct new partition π new initially, π new = π; for (each group G of π) { partition G into subgroups such that Compiler Construction 66

Minimizing DFA - Example Compiler Construction 67 b a a a b b 3 2 1 G 1 = {2} G 2 = {1,3} G 2 cannot be partitioned because move(1,a)=2 move(1,b)=3 move(3,a)=2 move(2,b)=3 So, the minimized DFA (with minimum states) {1,3} a a b b {2}

Minimizing DFA – Another Example Compiler Construction 68 b b b a a a a b 4 3 2 1 Groups: {1,2,3} {4} a b 1->2 1->3 2->2 2->3 3->4 3->3 {1,2} {3} no more partitioning So, the minimized DFA {1,2} {4} {3} b a a a b b
Tags