ppt_cd.pptx ppt on phases of compiler of jntuk syllabus
padmajagrandhe1
39 views
118 slides
Sep 17, 2024
Slide 1 of 118
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
About This Presentation
Compiler design phases with explanation of all 6 phases
Size: 928.27 KB
Language: en
Added: Sep 17, 2024
Slides: 118 pages
Slide Content
Heartily welcome to III CSE Students
Compiler Design III-I Sem. By Padmaja Grandhe Associate professor CSE Department PSCMRCET
Objective Understand the basic concept of compiler design, and its different phases which will be helpful to construct new tools like LEX, YACC, etc
TEXT BOOKS: Compilers, Principles Techniques and Tools. Alfred V Aho , Monical S. Lam, Ravi Sethi Jeffery D. Ullman,2nd edition,pearson,2007 Compiler Design K.Muneeswaran , OXFORD Principles of compiler design,2nd edition, Nandhini Prasad, Elsebier .
REFERENCE BOOKS: Compiler Construction, Principles and practice, Kenneth C Louden , CENGAGE Implementations of Compiler, A New approach to Compilers including the algebraic methods, Yunlinsu ,SPRINGER
UNIT – I Introduction Language Processing, Structure of a compiler the evaluation of Programming language The Science of building a Compiler application of Compiler Technology. Lexical Analysis-: The role of lexical analysis buffering, specification of tokens. Recognitions of tokens the lexical analyzer generator lex tool.
UNIT –II Syntax Analysis The Role of a parser Context free Grammars Writing A grammar Top down passing bottom up parsing Introduction Lr Parser
UNIT –III More Powerful LR parser (LR1, LALR) Using Ambiguous Grammars Equal Recovery in Lr parser Syntax Directed Transactions Definition, Evolution order of SDTS Application of SDTS. Syntax Directed Translation Schemes.
UNIT – IV Intermediated Code Generation Variants of Syntax trees 3 Address code Types and Deceleration Translation of Expressions Type Checking. Controlled Flow Back patching
UNIT – V Code generation Stack allocation of space access to Non Local data on the stack Heap Management code generation – Issues in design of code generation the target Language Address in the target code Basic blocks and Flow graphs. A Simple Code generation.
UNIT –VI Code Optimization Machine Independent Optimization. The principle sources of Optimization peep hole Optimization Introduction to Data flow Analysis.
1.Language Translator It is a program that translates one language to another Language. Ex:-compiler, assembler, interpreter, linker, loader and preprocessor .
Language processing System
1.1Preprocessor A preprocessor produce input to compilers. Macro processing File inclusion: include header files into the program text. Language Extensions:- add capabilities to the language by certain amounts to build-in macro #define, #include
1.2Compiler is a software which converts a program written in high level language (Source Language) to low level language (Object/Target/Machine Language). shows errors Prolog , C,ML, LISP
1.3Assembler An assembler is a type of computer program that interprets software programs written in assembly language into machine language
1.4. LINK-EDITOR (Linker) &LOADER : Usually a longer program is divided into smaller subprograms called modules. And these modules must be combined to execute the program. The process of combining the modules is done by the linker. In high level languages, some built in header files or libraries are stored. These libraries are predefined and these contain basic functions which are essential for executing the program. These functions are linked to the libraries by a program called Linker.
Loader is responsible for loading programs and libraries. It is one of the essential stages in the process of executing a program, as it places programs into memory and prepares them for execution. The final output is Absolute machine code that is loaded into a computer fixed storage location and may not be relocated during execution of a computer program. 1.4LINK-EDITOR (Linker) &LOADER Contd..
Short answer questions What is the purpose of Loader/Linker in language processing? List any 4 compilers and 2 interpreters you know. What is a preprocessor ? Mention its objectives. List the Differences between interpreter and compiler?
1.6 Interpreter and compiler differences Compiler Interpreter A compiler is a program which coverts the entire source code of a programming language into executable machine code for a CPU. interpreter takes a source program and runs it line by line, translating each line as it comes to it. Compiler takes large amount of time to analyze the entire source code but the overall execution time of the program is comparatively faster. Interpreter takes less amount of time to analyze the source code but the overall execution time of the program is slower. Compiler generates the error message only after scanning the whole program, so debugging is comparatively hard as the error can be present any where in the program. Its Debugging is easier as it continues translating the program until the error is met Generates intermediate object code. No intermediate object code is generated. C, C++, ML, LISP PYTHON,PERL
Questions given in university End Exams Describe the functionality of compilers in a typical language processing system What are program translators? Explain.
2.STRUCTURE OF THE COMPILER DESIGN / PHASES OF A COMPILER A compiler operates in phases. A phase is a logically interrelated operation that takes source program in one representation and produces output in another representation. 2 phases of compilation. Analysis Phase or Front-end (Machine Independent / Language Dependent) Synthesis Phase or Back-end (Machine Dependent/ Language independent)
Analysis Phase Synthesis Phase
2.1Lexical Analysis Or scanning Lexical analysis performed by Lexical analyzer or Scanner It is the first phase of the compiler. It gets input from the source program and produces tokens as output. It reads the characters one by one, starting from left to right and forms the tokens. Token : It represents a logically cohesive sequence of characters ex:-keywords, operators, identifiers, special symbols etc Example: a + b = 20 a, b,+,=,20 are all separate tokens.
2.1Lexical Analysis contd.. Group of characters forming a token is called the Lexeme . The lexical analyser not only generates a token but also enters the lexeme & its information into the symbol table it is not already there. Num1=900
2.2Syntax Analysis:- Is performed by syntax analyzer or parser. It is the second phase of the compiler. It is also known as parser. It gets the token stream as input from the lexical analyser of the compiler , analyses the syntactical structure of the given input using grammar rules specified. Reports the syntax errors if any..otherwise Generates Parse Tree or syntax tree as the output if no errors are presented. Syntax tree : It is a tree in which interior nodes are operators and exterior nodes are operands.
2.2Syntax Analysis contd.. Example: For a= b+c *2, syntax tree is < id,2> = < id,1> < id,3 > + 2 * Id1=id2+id3*2
2.3Semantic Analysis Semantic Analysis is the third phase of Compiler , performed by semantic analyzer. Semantic Analysis makes sure that declarations and statements of program are semantically correct. Semantic errors are reported C=0 A=b/c Semantic Analyzer: It uses syntax tree and symbol table to check whether the given program is semantically consistent with language definition. It gathers type information and stores it in either syntax tree or symbol table.
2.3Semantic Analysis CONTD.. Semantic Errors: examples Type mismatch, Undeclared variables, Reserved identifier misuse Functions of Semantic Analysis: Type Checking – Ensures that data types are used in a way consistent with their definition. Label Checking – A program should contain labels references. Flow Control Check – Keeps a check that control structures are used in a proper manner.(example: no break statement outside a loop) Implicit Type conversion or coercion A= b+c *60 if a,b,c are float automatically 60 is converted as 60.0
< id,2> = < id,1> < id,3 > + intoreal * 2
2.4.Intermediate code generation Performed by intermediate code generator. In the analysis-synthesis model of a compiler, the front end of a compiler translates a source program into an independent intermediate code, then the back end of the compiler uses this intermediate code to generate the target code. Commonly used intermediate codes are Three address code , postfix notation
2.4.Intermediate code generation contd.. A statement involving no more than three references(two for operands and one for result) is known as three address statement. A sequence of three address statements is known as three address code. Ex:- three address code for A= b+c *60 T1= intoreal (60) T2=<id3>*T1 T3=<id2> + T2 <id1>=T3
2.5. Code Optimization:- It is the fifth phase of the compiler. It gets the intermediate code as input and produces optimized intermediate code as output. This phase reduces the redundant code and attempts to improve the intermediate code so that faster-running machine code will result. During the code optimization, the result of the program is not affected. To improve the code generation, the optimization involves - deduction and removal of dead code (unreachable code). - calculation of constants in expressions and terms. collapsing of repeated expression into temporary string. - loop unrolling. - moving code outside the loop. removal of unwanted temporary variables .
2.6. Code Generation:- It is the final phase of the compiler. It gets input from code optimization phase and produces the target code or object code as result. Intermediate instructions are translated into a sequence of machine instructions that perform the same task. The code generation involves - allocation of register and memory - generation of correct references - generation of correct data types -generation of missing code.
2.7. Symbol Table Management (or) Book-keeping:- Symbol table is used to store all the information about identifiers used in the program. It is a data structure containing a record for each identifier, with fields for the attributes of the identifier. It allows to find the record for each identifier quickly and to store or retrieve data from that record. Whenever an identifier is detected in any of the phases, it is stored in the symbol table.
2.8.Error Handlers:- Each phase can encounter errors. After detecting an error, a phase must handle the error so that compilation can proceed. In lexical analysis, errors occur in separation of tokens. In syntax analysis, errors occur during construction of syntax tree. In semantic analysis, errors occur when the compiler detects constructs with right syntactic structure but no meaning and during type conversion. In code optimization, errors occur when the result is affected by the optimization. In code generation, it shows error when code is missing etc
Position= initial+rate *60 Output of lexical analyzer is <id1>=<id2>+<id3>60 Out put of syntax analyzer is
Out put of semantic analyzer is Out put of intermediate code generation is T1= Intofloat (60) T2=id3*T1 T3=Id2+t2 Id1=t3
Out put of code optimizer is T1= intoreal (60) T2=<id3>*T1 T3=<id2> + T2 <id1>=T3 T1=<id3>*60.0 T2=<id2>+T1 <id1>=T2 <id1>=<id2>+T1
Output of Code optimizer is T1=id3*60.0 Id1=id2+t1 Output of Code Generator is LDF destination, source---loads given floating source value in to destination STF destination, source --- stores given source in to destination ADDF result,op1,op2---result=op1+op2 Operands should be in registers or value directly for operations except load and store LDF r1,id2 LDF r2,id3 MULF r2,r2,60.0 ADDF r1,r1,r2 STF id1,r1
Assignment Conversion of infix to postfix notation process Differentiate between token, lexeme and pattern with examples. What happens in Analysis and Synthesis phases of compilation? What is the key difference between lexical analysis and parsing?
Pattern: is a rule associated with token that describes the set of strings ex:-Pi Lexeme is Pi Token <id1> Pattern is (letter)(letter/digit)*
3.Applications of Compiler Technology Implementation of High-Level Programming Languages Optimizations for Computer Architectures Design Design of New Computer Architectures Program Translations of New Computer Architectures Software Productivity Tools
3.1.Implementation of High-Level Programming Languages A high-level programming language defines a programming abstraction: the programmer expresses an algorithm using the language, and the compiler must translate that program to the target language. Higher-level programming languages are easier to program in, but are less efficient, that is, the target programs run more slowly. low-level language have more control over a computation and can, in principle, produce more efficient code lower-level programs are harder to write and — worse still — less portable, more prone to errors, and harder to maintain. Optimizing compilers include techniques to improve the performance of generated code, thus offsetting the inefficiency introduced by high-level abstractions.
C—1970 C++-1985 Java-1995 Data flow optimizations introduced Object orientation was first introduced in Simula in 1967, and has been incorporated in languages such as Smalltalk, C + + , C # , and Java. The key ideas behind object orientation are Data abstraction and Inheritance of properties, compiler optimization Easy to program because of reusability and modularity concepts
3.2.Optimizations for Computer Architectures The rapid evolution of computer architectures has also led to demand for new compiler technology. Almost all high-performance systems take advantage of the same two basic techniques: parallelism and memory hierarchies. Parallelism can be found at several levels: At the instruction level , where multiple operations are executed simultaneously at the processor level , where different threads of the same application are run on different processors. Memory hierarchies are a response to the basic limitation that we can build very fast storage or very large storage, both fast and large. Secondary memory, primary memory, cache memory
3.3 Design of New Computer Architectures In the early days of computer architecture design, compilers were developed after the machines were built. That has changed. Since programming in high-level languages is the norm, the performance of a computer system is determined not by its raw speed but also by how well compilers can exploit its features. Thus, in modern computer architecture development, compilers are developed in the processor-design stage, and compiled code, running on simulators, is used to evaluate the proposed architectural features. RISC Reduced instruction set computer, CISC(X86 ) Most general-purpose processor architectures, including PowerPC, SPARC, MIPS, Alpha, and PA-RISC, are based on the RISC concept.
3.4. Program Translations While we normally think of compiling as a translation from a high-level language to the machine level, the same technology can be applied to translate between different kinds of languages. The following are some of the important applications of program-translation techniques. Binary Translation Hardware Synthesis Database Query Interpreters Compiled Simulation
Binary translations Compiler technology can be used to translate the binary code for one machine to that of another, allowing a machine to run programs originally compiled for another instruction set. Binary translation technology has been used by various computer companies to increase the availability of software for their machines. T ransmeta Crusoe processor is a VLIW processor that relies on binary translation to convert x86 code into native VLIW code. Binary translation can also be used to provide backward compatibility. When the processor in the Apple Macintosh was changed from the Motorola MC 68040 to the PowerPC in 1994, binary translation was used to allow PowerPC processors run legacy MC 68040 code.
Hardware Synthesis Hardware designs are mostly described in high-level hardware description languages like Verilog and VHDL (Very high-speed integrated circuit Hardware Description Language). Hardware designs are typically described at the register transfer level (RTL), where variables represent registers and expressions represent combinational logic . Hardware-synthesis tools translate RTL descriptions automatically into gates, which are then mapped to transistors and eventually to a physical layout.
Database Query Interpreters Data base Query languages For example, query languages, especially SQL (Structured Query Language), are used to search databases. Database queries consist of predicates containing relational and boolean operators. They can be interpreted or compiled into commands to search a database for records satisfying that predicate.
Compiled Simulation Simulation is a general technique used in many scientific and engineering disciplines to understand a phenomenon or to validate a design. Inputs to a simulator usually include the description of the design and specific input parameters for that particular simulation run. Simulations can be very expensive. each experiment may take days to complete on a high-performance machine. Instead of writing a simulator that interprets the design, it is faster to compile the design to produce machine code that simulates that particular design natively. Compiled simulation can run orders of magnitude faster than an interpreter-based approach. Compiled simulation is used in many state-of-the-art tools that simulate designs written in Verilog or VHDL .
3.5. Software Productivity Tools errors are rampant in programs; errors may crash a system, produce wrong results, render a system vulnerable to security attacks, or even lead to catastrophic failures in critical systems. Testing is the primary technique for locating errors in programs. An interesting and promising complementary approach is to use data-flow analysis to locate errors statically The problem of finding all program errors is undecidable . A data flow analysis may be designed to warn the programmers of all possible statements violating a particular category of errors.
Optimized compiler includes Type Checking A= b+c A[50] Bounds Checking(array out of bounds, buffer overflow) Memory – Management Tools(garbage)
4.The Evolution of Programming Languages The first electronic computers appeared in the 1940's and were programmed in machine language by sequences of O's and l's that explicitly told the computer what operations to execute and in what order. The operations themselves were very low level: move data from one location to another, add the contents of two registers, compare two values, and so on. This kind of programming was slow, tedious, and error prone. And once written, the programs were hard to understand and modify.
4.1.The Move to Higher-level Languages mnemonic assembly languages in the early 1950's . Initially, the instructions in an assembly language were just mnemonic representations of machine instructions. Later, macro instructions were added to assembly languages so that a programmer could define parameterized shorthands for frequently used sequences of machine instructions.
A major step towards higher-level languages was made in the latter half of the 1950's with the development of Fortran for scientific computation, Cobol for business data processing, Lisp for symbolic computation . Programmers could more easily write numerical computations, business applications, and symbolic programs. These languages were so successful that they are still in use today.
Today, there are thousands of programming languages. They can be classified in a variety of ways. One classification is by generation . First-generation languages are the machine languages second-generation the assembly languages third-generation the higher-level languages like Fortran, Cobol, Lisp, C, C + + , C # , and Java. Fourth-generation languages are languages designed for specific applications like NOMAD for report generation, SQL for database queries, and Postscript for text formatting. The term fifth-generation language has been applied to logic- and constraint-based languages like Prolog and OPS5.
Another classification of languages uses the term imperative for languages in which a program specifies how a computation is to be done declarative for languages in which a program specifies what computation is to be done. Languages such as C, C + + , C # , and Java are imperative languages. In imperative languages there is a notion of program state and statements that change the state. Functional languages such as ML and Haskell and constraint logic languages such as Prolog are often considered to be declarative languages .
An object-oriented language is one that supports object-oriented programming, a programming style in which a program consists of a collection of objects that interact with one another. Simula 67 and Smalltalk are the earliest major object-oriented Languages. Languages such as C + + , C # , Java, and Ruby are more recent object-oriented languages. Scripting languages are interpreted languages with high-level operators designed for "gluing together" computations. These computations were originally called "scripts." Awk , JavaScript, Perl, PHP, Python, Ruby , and Tel are popular examples of scripting languages. Programs written in scripting languages are often much shorter than equivalent programs written in languages like C.
5. The Science of Building a Compiler A compiler must accept all source programs that conform to the specification of the language the set of source programs is infinit e and any program can be very large , consisting of possibly millions of lines of code . Any transformation performed by the compiler while translating a source program must preserve the meaning of the program being compiled. compiler development challenging.
5.1. Modelling in Compiler Design and Implementation The study of compilers is mainly a study of how we design the right mathematical models and choose the right algorithms , while balancing the need for generality and power against simplicity and efficiency. The fundamental models are finite-state machines and regular expressions . These models are useful for describing the lexical units of programs (keywords, identifiers, and such) and for describing the algorithms used by the compiler to recognize those units. Other models such as context-free grammars , used to describe the syntactic structure of programming languages . ex:-The nesting of parentheses or control constructs. Similarly, trees a re an important model for representing the structure of programs and their translation into object code.
5.2. The Science of Code Optimization The term "optimization" in compiler design refers to the attempts that a compiler makes to produce code that is more efficient than the obvious code. Compiler optimizations must meet the following design objectives: The optimization must be correct, that is, preserve the meaning of the compiled program. The optimization must improve the performance of many programs. The compilation time must be kept reasonable and it should be correct. The second goal is that the compiler must be effective in improving the performance of many input programs. Normally, performance means the speed of the program execution . Other factors like size of code-Embedded systems, less Power consumption-Mobile devices
Third, we need to keep the compilation time short to support a rapid development and debugging cycle . Finally, a compiler is a complex system . we must keep the system simple to assure that the engineering and maintenance costs of the compiler are manageable. There is an infinite number of program optimizations that we could implement, and it takes a nontrivial amount of effort to create a correct and effective optimization. We must prioritize the optimizations , implementing only those that lead to the greatest benefits on source programs encountered in practice.
Lexical Analysis-: The role of lexical analysis buffering specification of tokens. Recognitions of tokens lexical analyzer generator lex tool.
6.The role of lexical analysis buffering The main task of the lexical analyzer is to read the input characters of the source program, group them into lexemes, and produce as output a sequence of tokens for each lexeme in the source program. The stream of tokens is sent to the parser for syntax analysis. the lexical analyzer to interact with the symbol table as well. When the lexical analyzer discovers a lexeme constituting an identifier, it needs to enter that lexeme into the symbol table. In some cases, information regarding the kind of identifier may be read from the symbol table by the lexical analyzer to assist it in determining the proper token it must pass to the parser.
parser call the lexical analyzer by getNextToken command, causes the lexical analyzer to read characters from its input until it can identify the next lexeme and produces token for it &returns to the parser. Scanning & generation of tokens lexical analyzer performs certain other tasks besides identification of lexemes. stripping out comments and whitespace (blank, newline, tab, and perhaps other characters that are used to separate tokens in the input). correlating error messages generated by the compiler with the source program. For instance, the lexical analyzer may keep track of the number of newline characters seen, so it can associate a line number with each error message. In some compilers, the lexical analyzer makes a copy of the source program with the error messages inserted at the appropriate positions.
# inclue <\ int abc A= b+c
The role of lexical analysis buffering contd.. Lexical Analysis Versus Parsing Tokens, Patterns, and Lexemes Attributes for Tokens Lexical Errors
1. Lexical Analysis Versus Parsing There are a number of reasons why lexical analysis and parsing separated 1. Simplicity of design is the most important consideration. The separation of lexical and syntactic analysis often allows us to simplify at least one of these tasks. If we are designing a new language, separating lexical and syntactic concerns can lead to a cleaner overall language design. 2. Compiler efficiency is improved . A separate lexical analyzer allows us to apply specialized techniques that serve only the lexical task, not the job of parsing. In addition , specialized buffering techniques for reading input characters can speed up the compiler significantly. Compiler portability is enhanced. Input-device-specific peculiarities can be restricted to the lexical analyzer.
6.2.Tokens, Patterns, and Lexemes Token : It represents a logically cohesive sequence of characters ex:-keywords, operators, identifiers, special symbols etc Group of characters forming a token is called the Lexeme . Pattern: is a rule associated with token that describes the set of strings num1= b+c 1awe= a+b <id,1>=<id,2>+<id,3>
6.3. Attributes for Tokens When more than one lexeme can match a pattern, the lexical analyzer must provide the subsequent compiler phases additional information about the particular lexeme that matched. For example, the pattern for token number matches both 0 and 1 , but it is extremely important for the code generator to know which lexeme was found in the source program. Thus, in many cases the lexical analyzer returns to the parser not only a token name, but an attribute value that describes the lexeme represented by the token; the token name influences parsing decisions, while the attribute value influences translation of tokens after the parse. Normally, information about an identifier — e.g., its lexeme, its type , and the location at which it is first found (in case an error message about that identifier must be issued) — is kept in the symbol table. Thus, the appropriate attribute value for an identifier is a pointer to the symbol-table entry for that identifier.
6.4. Lexical Errors fi ( a == f ( x ) ) . . . a lexical analyzer cannot tell whether f i is a misspelling of the keyword if or an undeclared function identifier. Since f i is a valid lexeme for the token id, the lexical analyzer must return the token id to the parser and let some other phase of the compiler — probably the parser in this case — handle an error due to transposition of the letters. However, suppose a situation arises in which the lexical analyzer is unable to proceed because none of the patterns for tokens matches any prefix of the remaining input. The simplest recovery strategy is " panic mode " recovery. We delete successive characters from the remaining input, until the lexical analyzer can find a well-formed token at the beginning of what input is left. This recovery technique may confuse the parser, but in an interactive computing environment it may be quite adequate.
Other possible error-recovery actions are: Delete one character from the remaining input. Insert a missing character into the remaining input. Replace a character by another character. Transpose two adjacent characters. Transformations like these may be tried in an attempt to repair the input. The simplest such strategy is to see whether a prefix of the remaining input can be transformed into a valid lexeme by a single transformation. This strategy makes sense, since in practice most lexical errors involve a single character. A more general correction strategy is to find the smallest number of transformations needed to convert the source program into one that consists only of valid lexemes , but this approach is considered too expensive in practice to be worth the effort.
7.Input Buffering 7.1. Buffer Pairs An important scheme involves two buffers or buffer divided in to 2 parts, that are alternately reloaded, as suggested in Fig. 3.3. Each buffer is of the same size N, and N is usually the size of a disk block, e.g., 4096 bytes. Using one system read command we can read N characters into a buffer , rather than using one system call per character. If fewer than N characters remain in the input file, then a special character, represented by eof .
Two pointers to the input are maintained: Pointer lexemeBegin , marks the beginning of the current lexeme, whose extent we are attempting to determine. Pointer forward scans ahead until a pattern match is found. Once the next lexeme is determined, forward is set to the character at its right end. Then, after the lexeme is recorded as an attribute value of a token returned to the parser, lexemeBegin is set to the character immediately after the lexeme just found. In Fig. 3.3, we see forward has passed the end of the next lexeme, ** (the Fortran exponentiation operator), and must be retracted one position to its left. Advancing forward requires that we first test whether we have reached the end of one of the buffers, and if so, we must reload the other buffer from the input, and move forward to the beginning of the newly loaded buffer.
https://www.slideshare.net/dattatraygandhmal/input-buffering N Bytes N Bytes First Half Second Half
Abc = pqr *xyz A B C = P Lexeme beginning Forward ABC token as ID = operator Q r * X Y
2. Sentinels We can combine the buffer-end test with the test for the current character if we extend each buffer to hold a sentinel character at the end. The sentinel is a special character that cannot be part of the source program, and a natural choice is the character eof . Figure 3.4 shows the same arrangement as Fig. 3.3, but with the sentinels added. Note that eof retains its use as a marker for the end of the entire input. Any eof that appears other than at the end of a buffer means that the input is at an end.
8.Specification of Tokens 1. Strings and Languages 2. Operations on Languages 3. Regular Expressions 4. Regular Definitions
1. Strings and Languages An alphabet is any finite set of symbols. Typical examples of symbols are letters, digits, and punctuations. The set {0,1} is the binary alphabet. The following string-related terms are commonly used: A prefix of string s is any string obtained by removing zero or more symbols from the end of s. For example, ban, banana, and e are prefixes of banana. A suffix of string s is any string obtained by removing zero or more symbols from the beginning of s. For example, nana, banana, and e are suffixes of banana . A substring of s is obtained by deleting any prefix and any suffix from s. For instance, banana, nan , and e are substrings of banana.
The proper prefixes, suffixes, and substrings of a string s are those, prefixes, suffixes, and substrings, respectively, of s that are not e or not equal to s itself. A subsequence of s is any string formed by deleting zero or more not necessarily consecutive positions of s. For example, baan is a subsequence of banana. If x and y are strings, then the concatenation of x and y, denoted xy , is the string formed by appending y to x. For example, if x = dog and y = house, then xy — doghouse. The empty string is the identity under concatenation; that is, for any string s, es = se=s.
2. Operations on Languages operations on languages are union, concatenation, and closure, Union is the familiar operation on sets. The concatenation of languages is all strings formed by taking a string from the first language and a string from the second language, in all possible ways, and on concatenating them. The ( Kleene )closure of a language L, denoted L*, is the set of strings you get by concatenating L zero or more times. Finally, the positive closure, denoted L+, is the same as the Kleene closure, but without the term L°. That is, e will not be in L+ unless it is in L itself.
Example 3.3 i Let L be the set of letters {A, B , . . . , Z, a, b , . . . , z} and let D be the set of digits { 0 , 1 , . . . 9} . We may think of L and D in two, essentially equivalent, ways. L U D--- the language with 62 strings of length one, each of which strings is either one letter or one digit. LD --- is the set c-f 520 strings of length two, each consisting of one letter followed by one digit. L 4 --- is the set of all 4-letter strings. L* --- is the set of all strings of letters, including e, the empty string. L(L U D)* --- is the set of all strings of letters and digits beginning with a letter. D + --- is the set of all strings of one or more digits.
Regular Expression The language accepted by finite automata can be easily described by simple expressions called Regular Expressions. The languages accepted by some regular expression are referred to as Regular languages. A regular expression can also be described as a sequence of pattern that defines a string. Regular expressions are used to match character combinations in strings. String searching algorithm used this pattern to find the operations on a string.
I N D U C T I O N : There are four parts to the induction whereby larger regular expressions are built from smaller ones. Suppose r and s are regular expressions denoting languages L(r) and L(s), respectively. 1. (r)|(s) is a regular expression denoting the language L(r) U L(s). 2. (r)(s) is a regular expression denoting the language L(r)L(s). (r)* is a regular expression denoting (L(r))*.
Regular Definitions For notational convenience, we may wish to give names to certain regular expressions and use those names in subsequent expressions, as if the names were themselves symbols. If £ is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the form: D1->a D2-> jk D3->a* Each di is a new symbol, not in E and not the same as any other of the cTs ,
C identifiers are strings of letters, digits, and underscores. Here is a regular definition for the language of C identifiers. We shall conventionally use italics for the symbols defined in regular definitions. letter- -> A | B | - - - | Z | a | b | - - - | z | Digit->0|1|2|...9 Identi ->letter( letter|digit |_)*
Write the regular definition for representing number Number->[0-9] + Write the regular definition for representing floating number 95.89,.987 Digit->[0-9] Number->digit + Fnumber -> Number.Number 0.987
Unsigned numbers (integer or floating point) are strings such as 5280, 0.01234, 6.336E4, or 1.89E-4. The regular definition
Unsigned numbers (integer or floating point) are strings such as 5280, 0.01234, 6.336E4, or 1.89E-4. The regular definition digit->0|1|2|3..9 number->digit(digit)* Opfrac -> e|.number Opexp ->E(+|-|e) number|e Unum->number( opfrac )( opexp )
Recognition of Tokens Transition diagrams have a collection of nodes or circles, called states. Each state represents a condition that could occur during the process of scanning the input looking for a lexeme that matches one of several patterns Initial state, Final and non final states Final state-accept state Edges are directed from one state of the transition diagram to another. Each edge is labelled by a symbol or set of symbols.
Initial State Final State < > <= >= = <> A<b
The Lexical-Analyzer Generator Lex Tool Use of Lex Structure of Lex Programs Conflict Resolution in Lex The Lookahead Operator
1. Use of Lex
Structure of Lex Programs Lex program is separated into three sections by %% delimiters. The formal of Lex source is as follows: { definitions } %% { rules } %% { user subroutines }
Definitions include declarations of constant, variable and regular definitions. Rules define the statement of form p1 {action1} p2 {action2}.... pn {action}. Where pi describes the regular expression and action1 describes the actions what action the lexical analyzer should take when pattern pi matches a lexeme. User subroutines are auxiliary procedures needed by the actions. The subroutine can be loaded with the lexical analyzer and compiled separately.
Conflict Resolution in Lex The two rules that Lex uses to decide on the proper lexeme to select, when several prefixes of the input match one or more patterns: Rule1: Always prefer a longer prefix to a shorter prefix. Ex:- > = b+c Ge Rule2: If the longest possible prefix matches two or more patterns, prefer the pattern listed first in the Lex program. Ex:-if
4. The Lookahead Operator Lex automatically reads one character ahead of the last character that forms the selected lexeme, and then retracts the input so only the lexeme itself is consumed from the input. However, sometimes, we want a certain pattern to be matched to the input only when it is followed by a certain other characters. If so, we may use the slash in a pattern to indicate the end of the part of the pattern that matches the lexeme.
Rthjj 6878jjnkjk [A-Za-z0-9]* i ++
/* lex program to count number of words*/ %{ #include< stdio.h > #include< string.h > int i = 0; %} /* Rules Section*/ %% ([a-zA-Z0-9])* { i ++;} /* Rule for counting number of words*/ "\n" { printf ("%d\n", i ); i = 0;} %%
int yywrap (void){} int main() { // The function that starts the analysis yylex (); return 0; } Function yywrap is called by lex when input is exhausted . Return 1 if you are done or 0 if more processing is required. Every C program requires a main function. In this case we simply call yylex that is the main entry-point for lex .
/* Lex program to Identify and Count Positive and Negative Numbers */ %{ int positive_no = 0, negative_no = 0; %} % ^[-][0-9]+ { negative_no ++; printf ("negative number = %s\n", yytext );} [0-9]+ { positive_no ++; printf ("positive number = %s\n", yytext );} %%
/*** use code section ***/ int yywrap (){} int main() { yylex (); printf ("number of positive numbers = %d," "number of negative numbers = %d\n", positive_no , negative_no ); return 0; } yytext holds the text matched by the current token.
Regular expression to denote any string starting with a. Where ∑={ a,b } a(a/b)* Regular expression to denote any string ending with a. Where ∑={ a,b } (a/b)*a Regular expression to denote any string begining with a and ending with b. Where ∑={ a,b } A(a/b)*b
All strings of lowercase letters [a-z]* All strings of lowercase letters in which the letters in are in ascending lexicographic order. String a*b*c*d*…………z* All strings of a’s and b’s with an even number of a’s (b* ab * ab *)*
All strings of lowercase letters that contain the five vowels in order. Letter ->[b-d f-h j-n p-t v-z] String-> ( Letter|a )* ( Letter|e )* ( Letter|i )* ( Letter|o )* ( Letter|u )* Describe the languages denoted by the following regular expressions: ( i ) ( a|b )*a( a|b )( a|b . (ii) a* ba * ba * ba *
With a suitable transition diagram, explain recognition of keywords and identifiers.