Lexical analysis - Compiler Design

1,057 views 37 slides Mar 08, 2021
Slide 1
Slide 1 of 37
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37

About This Presentation

Role of Lexical Analyzer, Specify the tokens, Recognize the tokens, Lexical analyzer generator, Finite automata,
Design of lexical analyzer generator


Slide Content

Lexical Analyzer / Scanner
P. Kuppusamy -Lexical Analyzer

Outline
•Role of lexical analyzer
•Specification of tokens
•Recognition of tokens
•Lexical analyzer generator
•Finite automata
•Design of lexical analyzer generator
P. Kuppusamy -Lexical Analyzer

Lexical analyzer terminologies
Lexeme
Lexemesarethewordsderivedfromthecharacterinputstream.
Itisaparticularinstantofatoken.
Ex. cout<< 3+2-3;
Processing By Scanner : {cout} | space | {<<} | space | {3} {+} {2} {-} {3} {;}
Lexemes: cout, <<, 3, +, 2, +, 3, ;
Token
Groupofcharactershavingacollectivemeaning.
Tokensarelexemesmappedintoatoken-nameandanattribute-value
Syntax:.<identifier,lexeme> //lexemeisoptional
Tokens:<id1,cout> <operator1,<<><number1,3><operator,+>
<number2,2><operator2,-><number3,3><punctuator,;>
Pattern:
Rule that describes how a token can be formed.
E.g.
identifier: ([a-z] | [A-Z]) ([a-z]|[A-Z]|[0-9])*
P. Kuppusamy -Lexical Analyzer

Lexical analyzer
The scanner analyses the source program by reading character by character.
Then grouping the characters into individual words and symbols (tokens).
Remove the comment lines and white spaces (blank space, tab, newline, etc)
Identifiers and values are stored in symbol table.
Types of Tokens: constant, identifier, keyword, operator and symbol.
•E.g.
constpi = 3.14159;
Token 1: (const, -)
Token 2: (identifier, ‘pi’)
Token 3: (=, -)
Token 4: (realnumber, 3.14159)
Token 5: (; , -)
P. Kuppusamy -Lexical Analyzer

Role of Lexical analyzer
Lexical
Analyzer
Parser
Source
program
token
getNextToken()
Symbol
table
To semantic
analysis
•If LA identifies the lexeme constituting an identifier, lexeme needs
to store in the symbol table.
•Sometime type of identifier may read from symbol table.
P. Kuppusamy -Lexical Analyzer

WhydoseparateLexicalanalysisandparsing
•Lexicalanalyzerneednottobeanindividualphase.
•Buthavingaseparatephasesimplifiesthedesignand
improvesthecompilerefficiencyandportability.
ExampleofTokens
TokenInformaldescriptionSamplelexemes
if
else
comparison
id
number
literal
Charactersi,f
Characterse,l,s,e
<or>or<=or>=or==or!=
Letterfollowedbyletteranddigits
Anynumericconstant
Anythingbut“sorroundedby“
if
else
<=,!=
pi,score,D2
3.14159,0,6.02e23
“coredumped”,“(“,“{“,
“,”,“}”,“[“,“;”,etc.,
P. Kuppusamy -Lexical Analyzer

Tokens, their patterns, and attribute values
ws–whitespace (\t, \n, blank space, etc.)
relop–relational operator
P. Kuppusamy -Lexical Analyzer

Example: Find the tokens & lexemes
printf(“total = %d\n”, score);
TokenInformal descriptionLexemes
id
literal
Letter followed by letter and digits
Anything but “ sorrounded by “
printf, score
“total = %d\n”
“, ”
2. Find the tokens & lexemes for E = M*C^2
P. Kuppusamy -Lexical Analyzer

Attributes for tokens
•E = M * C ^2
<id, pointer to symbol table entry for E>
<assign-op>
<id, pointer to symbol table entry for M>
<mult-op>
<id, pointer to symbol table entry for C>
<exp-op>
<number, integer value 2>
P. Kuppusamy -Lexical Analyzer

RE Terminilogy:
alphabet: a finite set of symbols.
E.g. ∑ ={a, b, c}
A stringis a finite sequence of symbols over an alphabet ∑
(sometimes a string is also called a sentence or a word).
A languageis a set of strings over an alphabet ∑.
Tokens are specified by regular expressions effectively.
P. Kuppusamy -Lexical Analyzer

Regular expressions
•Ɛis a regular expression, L(Ɛ) = {Ɛ}
•If a is a symbol in ∑ then a is a regular expression, L(a) = {a}
•Operations on languages (a set):
•Union
(r) | (s) is a regular expression denoting the language L(r) ∪L(s)
•Concatenation
(r) . (s) is a regular expression denoting the language L(r) . L(s)
•Kleene closure
(r)* is a regular expression denoting (L(r))*
•Positive closure
(r) is a regular expression denting (L(r))
+i
i
LL



0
*
 i
i
LL




1

P. Kuppusamy -Lexical Analyzer

•Regular expressions are used to represent the patterns for the tokens.
•The regular definition for C identifiers (letters, digits,and underscores).
LetterA | B | C|…| Z | a | b | … |z| _
digit 0|1|2 |… | 9
id letter( letter | digit )*
•The regular definition for Unsigned numbers (integer orfloating point)
such as 5280, 0.01234, 6.336E4, or 1.89E-4.
digit 0|1|2 |… | 9
digits digit digit*
optionalFraction.digits | 
optionalExponent( E( + |-| ) digits ) | 
number digits optionalFractionoptionalExponent
•More examples: integer constant, string constants, reserved words, operator,
real constant.
Regular definition (Pattern)
P. Kuppusamy -Lexical Analyzer

RECOGNITION OF TOKENS
•Tokens recognized by state transition diagrams.
•Recognition of identifiers
•Recognition of Whitespace (delimiters)
•Recognition of Relational operator
•Recognition of Keywords
•Recognition of Numbers (integer / float)
P. Kuppusamy -Lexical Analyzer

RECOGNITION OF TOKENS
Given the grammar of branching statement:
The pattern for whitespace (ws), keywords & operators:
ws(blankspace| tab | newline)
+
if if
then then
else else
relop< | > | <= | >= | = | <>
P. Kuppusamy -Lexical Analyzer

Transition diagrams
•Transition diagram for relop
* indicates input retraction (pull back)
(i.e. not > , =)
(i.e. not =)
P. Kuppusamy -Lexical Analyzer

Transition diagrams (cont.)
•Transition diagram for reserved words and identifiers
•Transition diagram for whitespace
(i.e. not letter
or digit
(i.e. not delimiter)
P. Kuppusamy -Lexical Analyzer

Transition diagrams (cont.)
•Transition diagram for unsigned numbers
(i.e. not digit)
P. Kuppusamy -Lexical Analyzer

Major Data Structure in a
Compiler
P. Kuppusamy -Lexical Analyzer

Data Structure for Communication among Phases
•TOKENS
•Ascannercollectscharactersintoatoken,asavalueofan
enumerateddatatypefortokens
•Alsopreservethestringofcharactersorotherderived
information,suchasnameofidentifier,valueofanumbertoken
•Asingleglobalvariableoranarrayoftokens
•SYNTAX TREE
•Astandardpointer-basedstructuregeneratedbyparser
•Eachnoderepresentsinformationcollectbyparser,which
maybedynamicallyallocatedorstoredinsymboltable
•Thenoderequiresdifferentattributesdependingonkindof
languagestructure,whichmayberepresentedasvariablerecord.
P. Kuppusamy -Lexical Analyzer

•SYMBOLTABLE
•Keepsinformationassociatedwithidentifiers:function,variable,
constants,anddatatypes
•Interactswithalmosteveryphaseofcompiler.
•Accessoperationneedtobeconstant-time
•Oneorseveralhashtablesareoftenused,
•LITERALTABLE
•Storesconstantsandstrings,reducingsizeofprogram
•Quickinsertionandlookupareessential
Data Structure for Communication among Phases
P. Kuppusamy -Lexical Analyzer

•INTERMEDIATECODE
•Keptasanarrayoftextstring,atemporarytext,oralinkedlist
ofstructuresdependsonkindofintermediatecode(e.g.three-
addresscodeandp-code(Pascal-code))
•Shouldbeeasyforreorganization
•TEMPORARYFILES
•Holdstheproductofintermediatestepsduringcompiling
•Theproblemofmemoryconstraintsorback-patchisaddressed
duringcodegeneration
Data Structure for Communication among Phases
P. Kuppusamy -Lexical Analyzer

Bootstrapping in Compiler Design
•Bootstrappingisaprocesstotranslatesimplelanguageintomore
complicatedprogramwhichinturnmayhandleformore
complicatedprogram.
•Thiscomplicatedprogramcanfurtherhandleevenmore
complicatedprogramandsoon.
•Afullbootstrapisnecessarywhenbuildinganewcompilerfrom
scratch.
•Writingacompilerforanyhighlevellanguageisacomplicated
process.
•Ittakesmoretimetowriteacompilerfromscratch.
•Hencesimplelanguageisusedtogeneratetargetcodeinsome
stages(phases).
P. Kuppusamy -Lexical Analyzer

Tombstone (T) Diagram
Program P implemented in L
L
P
M
Machine implemented in hardware
M
L
Language interpreter in L
•TDiagramcontainssetof“puzzlepieces”that
representscompilersandotherrelatedlanguageprocessing
programs.
S T
L
Translator implemented in L
S –Lang
Compiler
accepts as i/p
T–Lang o/p
by Compiler
L–Lang Compiler
is written
P. Kuppusamy -Lexical Analyzer

Tombstone diagram: Combination rules
M
M
P
ok
M
L
P
Wrong!
L
P
ST
M
Wrong!
S
P P
T
ST
M
M
ok
Both should same
P. Kuppusamy -Lexical Analyzer

•WriteacrosscompilerfornewlanguageX.Lettheimplementation
languageofthiscompilerisYandthetargetcodebeinggeneratedisin
languageZ.Thatis,wecreateXYZ.
•NowifexistingcompilerYrunsonmachineMandgeneratescodefor
MthenitisdenotedasYMM.
Example:Bootstrapping in Compiler Design
X  Z
Y
Compiler1
Y  M
M
M
P. Kuppusamy -Lexical Analyzer

•NowifwerunXYZusingYMMthenwegetacompilerXMZ.That
meansacompilerforsourcelanguageXthatgeneratesatargetcodein
languageZandwhichrunsonmachineM.
Example:Bootstrapping in Compiler Design
X  Z
Y
Compiler1
Y  M
M
X  Z
M
Thiscrosscompilercan
beusedforbootstrapping
onmachineMbutwedo
notwanttorelyonit
permanently.
Cross Compiler
Compiler2
P. Kuppusamy -Lexical Analyzer

Bootstrap to improve efficiency
The efficiency of programs and compilers:
Efficiency of programs:
-memory usage
-runtime
Efficiency of compilers:
-Efficiency of the compiler itself
-Efficiency of the emitted code
P. Kuppusamy -Lexical Analyzer

Lexical Analyzer Generator -Lex
LexisthelanguageorToolforspecifyingtheLexicalAnalyzer.
LAspecifiestheRegularExpressions
REareusedtorepresentsthepatternsforthetokens.
P. Kuppusamy -Lexical Analyzer

Create Lexical Analyzer Generator -Lex
Lexical
Compiler
Lex Source program
lex.l
lex.yy.c
C
compiler
lex.yy.c a.out
(objcode)
a.outInput stream
Sequence
of tokens
P. Kuppusamy -Lexical Analyzer

Structure of Lex programs
declarations
%%
translation rules
%%
auxiliary functions
Pattern {Action}
It contains 3 sections
1. Declarations 2. Translation rules3. Auxiliary functions
Denotes starting of translation rules
Denotes end of translation rules
P. Kuppusamy -Lexical Analyzer

Lex : Example
1. Declaration section
i) Declare identifiers, constants:
%{
Declare identifiers, constants, definitions of manifest constants(LT, LE, EQ, NE, GT,
GE, IF, THEN, ELSE, ID, NUMBER, RELOP) here
}%
E.g.:
%{
inta,b;
float s;
}%
P. Kuppusamy -Lexical Analyzer

Declaration section (cont..d)
ii). Declare regular expressions (without % symbol)
/* regular definitions*/
delim [ \t\n]
ws{delim}
+
letter[A-Za-z]
digit[0-9]
id{letter}({letter}|{digit})*
number{digit}
+
(.{digit}
+
)?(E[+-]?{digit}
+
)?
? --Denotes that combination executes 0 or 1 time
P. Kuppusamy -Lexical Analyzer

2. Translation rules section
•It Starts with %% and ends with %%
•Syntax:
Form of Pattern followed by action
%%
Pattern1{Action1}
Pattern2{Action2}
………
Pattern n{Action n}
%%
Pattern is Regular Expression & Action is C statements
E.g.
%%
{ws}{/* no action and no return */}
if {return(IF);}
then{return(THEN);}
else{return(ELSE);}
{id}{yylval= (int) installID(); return(ID); }
{number} {yylval= (int) installNum(); return(NUMBER); }

%%
IntinstallID()
{/*funtiontoinstallthelexeme,
whosefirstcharacterispointed
tobyyytext,andwhoselength
isyyleng,intothesymboltable
andreturnapointerthereto*/
}
IntinstallNum()
{/*similartoinstallID,butputs
numericalconstantsintoa
separatetable*/
}
P. Kuppusamy -Lexical Analyzer

3. Auxiliary function section
All required functions are defined in this section
This program used 2 predefined functions that are installID(),
installNum().
These functions need not to be defined here.
E.g.
Some user defined functions are
add()
{
// statements
}
mul()
{
//statements
}
IntinstallID()
{/*funtiontoinstallthelexeme,
whosefirstcharacterispointed
tobyyytext,andwhoselengthis
yyleng,intothesymboltableand
returnapointerthereto*/
}
IntinstallNum()
{/*similartoinstallID,butputs
numericalconstantsintoa
separatetable*/
}
P. Kuppusamy -Lexical Analyzer

Example -Lex program to recognize tokens:
Identify keywords, rel. operator, numbers
%{
/*no variable declaration */
}%
letter[A-Za-z]
digit[0-9]
id {letter}({letter}|{digit})*
number {digit}
+
(.{digit}+)?(E[+-]?{digit}+)?
%%
{id}{printf(“%s is an identifier”, yytext);} //yytestprovides token id , {} is optional
if {printf(“%s is keyword”, yytext);}
else{printf(“%s is keyword”, yytext);}
“<“{printf(“%s is Less than operator”, yytext);}
P. Kuppusamy -Lexical Analyzer

Lex program to recognize tokens (cont..d)
“>“{printf(“%s is Greater than operator”, yytext);}
“ <=“{printf(“%s is Less than or equal operator”, yytext); }
{number} {printf(“%s is number”, yytext);}
%%
P. Kuppusamy -Lexical Analyzer

Reference
•A.V. Aho, M.S. Lam, R. Sethi, J. D. Ullman, Compilers Principles,
Techniques and Tools, Pearson Edition, 2013.
•Lex&yacc–JohnR.Levine,TonyMason,DougBrown,O’reilly
P. Kuppusamy -Lexical Analyzer