Role of Lexical Analyzer, Specify the tokens, Recognize the tokens, Lexical analyzer generator, Finite automata,
Design of lexical analyzer generator
Size: 1.02 MB
Language: en
Added: Mar 08, 2021
Slides: 37 pages
Slide Content
Lexical Analyzer / Scanner
P. Kuppusamy -Lexical Analyzer
Outline
•Role of lexical analyzer
•Specification of tokens
•Recognition of tokens
•Lexical analyzer generator
•Finite automata
•Design of lexical analyzer generator
P. Kuppusamy -Lexical Analyzer
Lexical analyzer terminologies
Lexeme
Lexemesarethewordsderivedfromthecharacterinputstream.
Itisaparticularinstantofatoken.
Ex. cout<< 3+2-3;
Processing By Scanner : {cout} | space | {<<} | space | {3} {+} {2} {-} {3} {;}
Lexemes: cout, <<, 3, +, 2, +, 3, ;
Token
Groupofcharactershavingacollectivemeaning.
Tokensarelexemesmappedintoatoken-nameandanattribute-value
Syntax:.<identifier,lexeme> //lexemeisoptional
Tokens:<id1,cout> <operator1,<<><number1,3><operator,+>
<number2,2><operator2,-><number3,3><punctuator,;>
Pattern:
Rule that describes how a token can be formed.
E.g.
identifier: ([a-z] | [A-Z]) ([a-z]|[A-Z]|[0-9])*
P. Kuppusamy -Lexical Analyzer
Lexical analyzer
The scanner analyses the source program by reading character by character.
Then grouping the characters into individual words and symbols (tokens).
Remove the comment lines and white spaces (blank space, tab, newline, etc)
Identifiers and values are stored in symbol table.
Types of Tokens: constant, identifier, keyword, operator and symbol.
•E.g.
constpi = 3.14159;
Token 1: (const, -)
Token 2: (identifier, ‘pi’)
Token 3: (=, -)
Token 4: (realnumber, 3.14159)
Token 5: (; , -)
P. Kuppusamy -Lexical Analyzer
Role of Lexical analyzer
Lexical
Analyzer
Parser
Source
program
token
getNextToken()
Symbol
table
To semantic
analysis
•If LA identifies the lexeme constituting an identifier, lexeme needs
to store in the symbol table.
•Sometime type of identifier may read from symbol table.
P. Kuppusamy -Lexical Analyzer
WhydoseparateLexicalanalysisandparsing
•Lexicalanalyzerneednottobeanindividualphase.
•Buthavingaseparatephasesimplifiesthedesignand
improvesthecompilerefficiencyandportability.
ExampleofTokens
TokenInformaldescriptionSamplelexemes
if
else
comparison
id
number
literal
Charactersi,f
Characterse,l,s,e
<or>or<=or>=or==or!=
Letterfollowedbyletteranddigits
Anynumericconstant
Anythingbut“sorroundedby“
if
else
<=,!=
pi,score,D2
3.14159,0,6.02e23
“coredumped”,“(“,“{“,
“,”,“}”,“[“,“;”,etc.,
P. Kuppusamy -Lexical Analyzer
Tokens, their patterns, and attribute values
ws–whitespace (\t, \n, blank space, etc.)
relop–relational operator
P. Kuppusamy -Lexical Analyzer
Example: Find the tokens & lexemes
printf(“total = %d\n”, score);
TokenInformal descriptionLexemes
id
literal
Letter followed by letter and digits
Anything but “ sorrounded by “
printf, score
“total = %d\n”
“, ”
2. Find the tokens & lexemes for E = M*C^2
P. Kuppusamy -Lexical Analyzer
Attributes for tokens
•E = M * C ^2
<id, pointer to symbol table entry for E>
<assign-op>
<id, pointer to symbol table entry for M>
<mult-op>
<id, pointer to symbol table entry for C>
<exp-op>
<number, integer value 2>
P. Kuppusamy -Lexical Analyzer
RE Terminilogy:
alphabet: a finite set of symbols.
E.g. ∑ ={a, b, c}
A stringis a finite sequence of symbols over an alphabet ∑
(sometimes a string is also called a sentence or a word).
A languageis a set of strings over an alphabet ∑.
Tokens are specified by regular expressions effectively.
P. Kuppusamy -Lexical Analyzer
Regular expressions
•Ɛis a regular expression, L(Ɛ) = {Ɛ}
•If a is a symbol in ∑ then a is a regular expression, L(a) = {a}
•Operations on languages (a set):
•Union
(r) | (s) is a regular expression denoting the language L(r) ∪L(s)
•Concatenation
(r) . (s) is a regular expression denoting the language L(r) . L(s)
•Kleene closure
(r)* is a regular expression denoting (L(r))*
•Positive closure
(r) is a regular expression denting (L(r))
+i
i
LL
0
*
i
i
LL
1
P. Kuppusamy -Lexical Analyzer
•Regular expressions are used to represent the patterns for the tokens.
•The regular definition for C identifiers (letters, digits,and underscores).
LetterA | B | C|…| Z | a | b | … |z| _
digit 0|1|2 |… | 9
id letter( letter | digit )*
•The regular definition for Unsigned numbers (integer orfloating point)
such as 5280, 0.01234, 6.336E4, or 1.89E-4.
digit 0|1|2 |… | 9
digits digit digit*
optionalFraction.digits |
optionalExponent( E( + |-| ) digits ) |
number digits optionalFractionoptionalExponent
•More examples: integer constant, string constants, reserved words, operator,
real constant.
Regular definition (Pattern)
P. Kuppusamy -Lexical Analyzer
RECOGNITION OF TOKENS
•Tokens recognized by state transition diagrams.
•Recognition of identifiers
•Recognition of Whitespace (delimiters)
•Recognition of Relational operator
•Recognition of Keywords
•Recognition of Numbers (integer / float)
P. Kuppusamy -Lexical Analyzer
RECOGNITION OF TOKENS
Given the grammar of branching statement:
The pattern for whitespace (ws), keywords & operators:
ws(blankspace| tab | newline)
+
if if
then then
else else
relop< | > | <= | >= | = | <>
P. Kuppusamy -Lexical Analyzer
Transition diagrams
•Transition diagram for relop
* indicates input retraction (pull back)
(i.e. not > , =)
(i.e. not =)
P. Kuppusamy -Lexical Analyzer
Transition diagrams (cont.)
•Transition diagram for reserved words and identifiers
•Transition diagram for whitespace
(i.e. not letter
or digit
(i.e. not delimiter)
P. Kuppusamy -Lexical Analyzer
Transition diagrams (cont.)
•Transition diagram for unsigned numbers
(i.e. not digit)
P. Kuppusamy -Lexical Analyzer
Major Data Structure in a
Compiler
P. Kuppusamy -Lexical Analyzer
Data Structure for Communication among Phases
•TOKENS
•Ascannercollectscharactersintoatoken,asavalueofan
enumerateddatatypefortokens
•Alsopreservethestringofcharactersorotherderived
information,suchasnameofidentifier,valueofanumbertoken
•Asingleglobalvariableoranarrayoftokens
•SYNTAX TREE
•Astandardpointer-basedstructuregeneratedbyparser
•Eachnoderepresentsinformationcollectbyparser,which
maybedynamicallyallocatedorstoredinsymboltable
•Thenoderequiresdifferentattributesdependingonkindof
languagestructure,whichmayberepresentedasvariablerecord.
P. Kuppusamy -Lexical Analyzer
•SYMBOLTABLE
•Keepsinformationassociatedwithidentifiers:function,variable,
constants,anddatatypes
•Interactswithalmosteveryphaseofcompiler.
•Accessoperationneedtobeconstant-time
•Oneorseveralhashtablesareoftenused,
•LITERALTABLE
•Storesconstantsandstrings,reducingsizeofprogram
•Quickinsertionandlookupareessential
Data Structure for Communication among Phases
P. Kuppusamy -Lexical Analyzer
•INTERMEDIATECODE
•Keptasanarrayoftextstring,atemporarytext,oralinkedlist
ofstructuresdependsonkindofintermediatecode(e.g.three-
addresscodeandp-code(Pascal-code))
•Shouldbeeasyforreorganization
•TEMPORARYFILES
•Holdstheproductofintermediatestepsduringcompiling
•Theproblemofmemoryconstraintsorback-patchisaddressed
duringcodegeneration
Data Structure for Communication among Phases
P. Kuppusamy -Lexical Analyzer
Bootstrapping in Compiler Design
•Bootstrappingisaprocesstotranslatesimplelanguageintomore
complicatedprogramwhichinturnmayhandleformore
complicatedprogram.
•Thiscomplicatedprogramcanfurtherhandleevenmore
complicatedprogramandsoon.
•Afullbootstrapisnecessarywhenbuildinganewcompilerfrom
scratch.
•Writingacompilerforanyhighlevellanguageisacomplicated
process.
•Ittakesmoretimetowriteacompilerfromscratch.
•Hencesimplelanguageisusedtogeneratetargetcodeinsome
stages(phases).
P. Kuppusamy -Lexical Analyzer
Tombstone (T) Diagram
Program P implemented in L
L
P
M
Machine implemented in hardware
M
L
Language interpreter in L
•TDiagramcontainssetof“puzzlepieces”that
representscompilersandotherrelatedlanguageprocessing
programs.
S T
L
Translator implemented in L
S –Lang
Compiler
accepts as i/p
T–Lang o/p
by Compiler
L–Lang Compiler
is written
P. Kuppusamy -Lexical Analyzer
Tombstone diagram: Combination rules
M
M
P
ok
M
L
P
Wrong!
L
P
ST
M
Wrong!
S
P P
T
ST
M
M
ok
Both should same
P. Kuppusamy -Lexical Analyzer
•WriteacrosscompilerfornewlanguageX.Lettheimplementation
languageofthiscompilerisYandthetargetcodebeinggeneratedisin
languageZ.Thatis,wecreateXYZ.
•NowifexistingcompilerYrunsonmachineMandgeneratescodefor
MthenitisdenotedasYMM.
Example:Bootstrapping in Compiler Design
X Z
Y
Compiler1
Y M
M
M
P. Kuppusamy -Lexical Analyzer
•NowifwerunXYZusingYMMthenwegetacompilerXMZ.That
meansacompilerforsourcelanguageXthatgeneratesatargetcodein
languageZandwhichrunsonmachineM.
Example:Bootstrapping in Compiler Design
X Z
Y
Compiler1
Y M
M
X Z
M
Thiscrosscompilercan
beusedforbootstrapping
onmachineMbutwedo
notwanttorelyonit
permanently.
Cross Compiler
Compiler2
P. Kuppusamy -Lexical Analyzer
Bootstrap to improve efficiency
The efficiency of programs and compilers:
Efficiency of programs:
-memory usage
-runtime
Efficiency of compilers:
-Efficiency of the compiler itself
-Efficiency of the emitted code
P. Kuppusamy -Lexical Analyzer
2. Translation rules section
•It Starts with %% and ends with %%
•Syntax:
Form of Pattern followed by action
%%
Pattern1{Action1}
Pattern2{Action2}
………
Pattern n{Action n}
%%
Pattern is Regular Expression & Action is C statements
E.g.
%%
{ws}{/* no action and no return */}
if {return(IF);}
then{return(THEN);}
else{return(ELSE);}
{id}{yylval= (int) installID(); return(ID); }
{number} {yylval= (int) installNum(); return(NUMBER); }
…
%%
IntinstallID()
{/*funtiontoinstallthelexeme,
whosefirstcharacterispointed
tobyyytext,andwhoselength
isyyleng,intothesymboltable
andreturnapointerthereto*/
}
IntinstallNum()
{/*similartoinstallID,butputs
numericalconstantsintoa
separatetable*/
}
P. Kuppusamy -Lexical Analyzer
3. Auxiliary function section
All required functions are defined in this section
This program used 2 predefined functions that are installID(),
installNum().
These functions need not to be defined here.
E.g.
Some user defined functions are
add()
{
// statements
}
mul()
{
//statements
}
IntinstallID()
{/*funtiontoinstallthelexeme,
whosefirstcharacterispointed
tobyyytext,andwhoselengthis
yyleng,intothesymboltableand
returnapointerthereto*/
}
IntinstallNum()
{/*similartoinstallID,butputs
numericalconstantsintoa
separatetable*/
}
P. Kuppusamy -Lexical Analyzer
Example -Lex program to recognize tokens:
Identify keywords, rel. operator, numbers
%{
/*no variable declaration */
}%
letter[A-Za-z]
digit[0-9]
id {letter}({letter}|{digit})*
number {digit}
+
(.{digit}+)?(E[+-]?{digit}+)?
%%
{id}{printf(“%s is an identifier”, yytext);} //yytestprovides token id , {} is optional
if {printf(“%s is keyword”, yytext);}
else{printf(“%s is keyword”, yytext);}
“<“{printf(“%s is Less than operator”, yytext);}
P. Kuppusamy -Lexical Analyzer
Lex program to recognize tokens (cont..d)
“>“{printf(“%s is Greater than operator”, yytext);}
“ <=“{printf(“%s is Less than or equal operator”, yytext); }
{number} {printf(“%s is number”, yytext);}
%%
P. Kuppusamy -Lexical Analyzer
Reference
•A.V. Aho, M.S. Lam, R. Sethi, J. D. Ullman, Compilers Principles,
Techniques and Tools, Pearson Edition, 2013.
•Lex&yacc–JohnR.Levine,TonyMason,DougBrown,O’reilly
P. Kuppusamy -Lexical Analyzer